Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Practical Dynamic Proxy Setup with Puppeteer for Improved Data Scraping Efficiency

Tech 1

Puppeteer is a Node.js library maintained by the Chrome DevTools team, enabling programmatic control of headless or full Chrome/Chromium browsers. It exposes high-level APIs for common browser automation tasks including page navigation, screenshot capture, PDF ganeration, and network traffic monitoring. Dynamic proxy configuration with Puppeteer helps bypass website anti-scraping restrictions and improves the reliability and efficiency of data scraping workflows.

Configure Proxy and Launch Browser

Prepare a valid proxy server (either HTTP or SOCKS) that supports HTTPS traffic and has accessible, functional IP addresses. The following code snippet initializes a Pupeteer browser instance with configured proxy settings:

const puppeteer = require('puppeteer');
const axios = require('axios').default;
const fs = require('fs').promises;
const path = require('path');

(async () => {
  // Proxy configuration parameters
  const proxyDomain = "proxy.example.com";
  const proxyPort = "8080";
  const proxyUsername = "your_proxy_user";
  const proxyPassword = "your_proxy_pass";

  // Construct authenticated proxy URL
  const authenticatedProxyUrl = `http://${proxyUsername}:${proxyPassword}@${proxyDomain}:${proxyPort}`;

  // Launch browser with proxy enabled
  const browser = await puppeteer.launch({
    headless: "new",
    args: [
      `--proxy-server=${authenticatedProxyUrl}`,
      '--no-sandbox',
      '--disable-setuid-sandbox'
    ]
  });

  const page = await browser.newPage();
  // Set custom user agent to mimic real browser
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');

Navigate to Target Website

Use the page.goto() method to load the target page with appropriate wait conditions to ensure full resource loading:

  const targetSite = "https://example.com/target-page";
  await page.goto(targetSite, { waitUntil: 'networkidle2', timeout: 30000 });

Wait for Target Elements

Wait for image elements to render before extracting their source URLs:

  await page.waitForSelector('img[src]', { timeout: 15000 });

Extract Image Resource URLs

Execute browser context code to collect all valid image source attributes from the loaded page:

  const imageUrls = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('img[src]')).map(img => img.src.trim());
  });

Download Scraped Images

Puppeteer does not include a native download method, so use Axios to fetch and save image files locally with proper error handling:

  const outputDir = path.join(__dirname, 'scraped-images');
  await fs.mkdir(outputDir, { recursive: true });

  const saveImage = async (imageUrl) => {
    try {
      const response = await axios({
        method: 'GET',
        url: imageUrl,
        responseType: 'stream',
        proxy: {
          host: proxyDomain,
          port: parseInt(proxyPort),
          auth: {
            username: proxyUsername,
            password: proxyPassword
          }
        }
      });

      const fileName = path.basename(new URL(imageUrl).pathname) || `image-${Date.now()}.jpg`;
      const savePath = path.join(outputDir, fileName);

      return new Promise((resolve, reject) => {
        response.data.pipe(fs.createWriteStream(savePath))
          .on('finish', resolve)
          .on('error', reject);
      });
    } catch (err) {
      console.log(`Skipped ${imageUrl}: ${err.message}`);
    }
  };

  for (const url of imageUrls) {
    await saveImage(url);
  }

  await browser.close();
})();

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.