Home > Tech > Content

Practical Dynamic Proxy Setup with Puppeteer for Improved Data Scraping Efficiency

Tech 1

Puppeteer is a Node.js library maintained by the Chrome DevTools team, enabling programmatic control of headless or full Chrome/Chromium browsers. It exposes high-level APIs for common browser automation tasks including page navigation, screenshot capture, PDF ganeration, and network traffic monitoring. Dynamic proxy configuration with Puppeteer helps bypass website anti-scraping restrictions and improves the reliability and efficiency of data scraping workflows.

Configure Proxy and Launch Browser

Prepare a valid proxy server (either HTTP or SOCKS) that supports HTTPS traffic and has accessible, functional IP addresses. The following code snippet initializes a Pupeteer browser instance with configured proxy settings:

const puppeteer = require('puppeteer');
const axios = require('axios').default;
const fs = require('fs').promises;
const path = require('path');

(async () => {
  // Proxy configuration parameters
  const proxyDomain = "proxy.example.com";
  const proxyPort = "8080";
  const proxyUsername = "your_proxy_user";
  const proxyPassword = "your_proxy_pass";

  // Construct authenticated proxy URL
  const authenticatedProxyUrl = `http://${proxyUsername}:${proxyPassword}@${proxyDomain}:${proxyPort}`;

  // Launch browser with proxy enabled
  const browser = await puppeteer.launch({
    headless: "new",
    args: [
      `--proxy-server=${authenticatedProxyUrl}`,
      '--no-sandbox',
      '--disable-setuid-sandbox'
    ]
  });

  const page = await browser.newPage();
  // Set custom user agent to mimic real browser
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');

Navigate to Target Website

Use the page.goto() method to load the target page with appropriate wait conditions to ensure full resource loading:

  const targetSite = "https://example.com/target-page";
  await page.goto(targetSite, { waitUntil: 'networkidle2', timeout: 30000 });

Wait for Target Elements

Wait for image elements to render before extracting their source URLs:

  await page.waitForSelector('img[src]', { timeout: 15000 });

Extract Image Resource URLs

Execute browser context code to collect all valid image source attributes from the loaded page:

  const imageUrls = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('img[src]')).map(img => img.src.trim());
  });

Download Scraped Images

Puppeteer does not include a native download method, so use Axios to fetch and save image files locally with proper error handling:

  const outputDir = path.join(__dirname, 'scraped-images');
  await fs.mkdir(outputDir, { recursive: true });

  const saveImage = async (imageUrl) => {
    try {
      const response = await axios({
        method: 'GET',
        url: imageUrl,
        responseType: 'stream',
        proxy: {
          host: proxyDomain,
          port: parseInt(proxyPort),
          auth: {
            username: proxyUsername,
            password: proxyPassword
          }
        }
      });

      const fileName = path.basename(new URL(imageUrl).pathname) || `image-${Date.now()}.jpg`;
      const savePath = path.join(outputDir, fileName);

      return new Promise((resolve, reject) => {
        response.data.pipe(fs.createWriteStream(savePath))
          .on('finish', resolve)
          .on('error', reject);
      });
    } catch (err) {
      console.log(`Skipped ${imageUrl}: ${err.message}`);
    }
  };

  for (const url of imageUrls) {
    await saveImage(url);
  }

  await browser.close();
})();

Tags: Puppeteer Dynamic Proxy

Back to List

Prev: Implementing PLC Communication for .NET HMI Applications

Next: Managing QWidget Geometry and Properties through Qt Slots

Fading Coder

Practical Dynamic Proxy Setup with Puppeteer for Improved Data Scraping Efficiency

Configure Proxy and Launch Browser

Navigate to Target Website

Wait for Target Elements

Extract Image Resource URLs

Download Scraped Images

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Practical Dynamic Proxy Setup with Puppeteer for Improved Data Scraping Efficiency

Configure Proxy and Launch Browser

Navigate to Target Website

Wait for Target Elements

Extract Image Resource URLs

Download Scraped Images

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment