Automating Lazy-Loaded Image Scraping and Screenshots with Selenium WebDriver
Handling dynamic web pages often requires interacting with JavaScript elements that load content only when visible in the viewport. Standard HTTP requests fail here because the DOM is populated asynchronously. Selenium WebDriver provides a solution by controlling a real browser instance, allowing for actions like scroling to trigger load events and capturing the state via screenshots.
The following implementation demonstrates how to navigate to a target URL, iteratively scroll down to reveal hidden content, capture screenshots at each interval, and count the number of loaded image elements. This approach is particularly useful for galleries or infinite scroll pages where data retrieval depends on user interaction simulation.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
class LazyLoadHandler:
def __init__(self):
# Configure headless browser options
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
# Initialize the driver (ensure chromedriver is in PATH)
self.driver = webdriver.Chrome(options=options)
self.driver.set_window_size(1920, 1080)
def process_page(self):
target_url = "https://mm.taobao.com/self/album_photo.htm?spm=719.6642053.0.0.o5BDC0&user_id=687471686&album_id=183809402&album_flag=0"
try:
self.driver.get(target_url)
# JavaScript snippet to scroll to the bottom
scroll_script = "window.scrollTo(0, document.body.scrollHeight);"
for iteration in range(50):
# Execute scroll action
self.driver.execute_script(scroll_script)
# Allow time for dynamic content to render
time.sleep(0.2)
# Capture the current state of the viewport
filename = f"screenshot_iteration_{iteration}.png"
self.driver.save_screenshot(filename)
# Locate image containers based on specific class structure
elements = self.driver.find_elements(By.XPATH, '//div[@class="mm-photoW-cell-middle"]')
count = len(elements)
print(f"Iteration {iteration + 1}: Detected {count} loaded images.")
except Exception as e:
print(f"An error occurred during processing: {e}")
finally:
self.driver.quit()
if __name__ == '__main__':
handler = LazyLoadHandler()
handler.process_page()
Upon execution, the script initializes a headless browser session and navigtaes to the specified gallery. As the loop progresses, the page scrolls downward, triggering the lazy-loading mechanism. The console output reflects the increasing count of discovered image elements as more content becomes visible in the DOM. Screenshots are saved sequentially, providing a visual record of the loading process at each step.