Practical Dynamic Proxy Setup with Puppeteer for Improved Data Scraping Efficiency
Puppeteer is a Node.js library maintained by the Chrome DevTools team, enabling programmatic control of headless or full Chrome/Chromium browsers. It exposes high-level APIs for common browser automation tasks including page navigation, screenshot capture, PDF ganeration, and network traffic monitoring. Dynamic proxy configuration with Puppeteer helps bypass website anti-scraping restrictions and improves the reliability and efficiency of data scraping workflows.
Configure Proxy and Launch Browser
Prepare a valid proxy server (either HTTP or SOCKS) that supports HTTPS traffic and has accessible, functional IP addresses. The following code snippet initializes a Pupeteer browser instance with configured proxy settings:
const puppeteer = require('puppeteer');
const axios = require('axios').default;
const fs = require('fs').promises;
const path = require('path');
(async () => {
// Proxy configuration parameters
const proxyDomain = "proxy.example.com";
const proxyPort = "8080";
const proxyUsername = "your_proxy_user";
const proxyPassword = "your_proxy_pass";
// Construct authenticated proxy URL
const authenticatedProxyUrl = `http://${proxyUsername}:${proxyPassword}@${proxyDomain}:${proxyPort}`;
// Launch browser with proxy enabled
const browser = await puppeteer.launch({
headless: "new",
args: [
`--proxy-server=${authenticatedProxyUrl}`,
'--no-sandbox',
'--disable-setuid-sandbox'
]
});
const page = await browser.newPage();
// Set custom user agent to mimic real browser
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
Navigate to Target Website
Use the page.goto() method to load the target page with appropriate wait conditions to ensure full resource loading:
const targetSite = "https://example.com/target-page";
await page.goto(targetSite, { waitUntil: 'networkidle2', timeout: 30000 });
Wait for Target Elements
Wait for image elements to render before extracting their source URLs:
await page.waitForSelector('img[src]', { timeout: 15000 });
Extract Image Resource URLs
Execute browser context code to collect all valid image source attributes from the loaded page:
const imageUrls = await page.evaluate(() => {
return Array.from(document.querySelectorAll('img[src]')).map(img => img.src.trim());
});
Download Scraped Images
Puppeteer does not include a native download method, so use Axios to fetch and save image files locally with proper error handling:
const outputDir = path.join(__dirname, 'scraped-images');
await fs.mkdir(outputDir, { recursive: true });
const saveImage = async (imageUrl) => {
try {
const response = await axios({
method: 'GET',
url: imageUrl,
responseType: 'stream',
proxy: {
host: proxyDomain,
port: parseInt(proxyPort),
auth: {
username: proxyUsername,
password: proxyPassword
}
}
});
const fileName = path.basename(new URL(imageUrl).pathname) || `image-${Date.now()}.jpg`;
const savePath = path.join(outputDir, fileName);
return new Promise((resolve, reject) => {
response.data.pipe(fs.createWriteStream(savePath))
.on('finish', resolve)
.on('error', reject);
});
} catch (err) {
console.log(`Skipped ${imageUrl}: ${err.message}`);
}
};
for (const url of imageUrls) {
await saveImage(url);
}
await browser.close();
})();