Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Configuring Chrome Browser Options with Selenium and Python

Tech 1

Background

When using Selenium for browser rendering to scrape websites, the default is a clean Chrome browser. However, we often use browser extensions, proxies, or other customizations during normal browsing. Correspondingly, when scraping with Chrome, we may need to apply specific configurations to optimize the scraper's behavior.

Common configurations include:

  • Disabling image and video loading to speed up page loading.
  • Adding a proxy to access certain pages or bypass IP-based anti-scraping measures.
  • Using mobile user agents to access mobile sites, wich often have weaker anti-scraping defenses.
  • Adding extensions to replicate normal browser functionality.
  • Setting encoding to prevent garbled text on Chinese sites.
  • Disabling JavaScript execusion.
  • And more.

Environment

  • Python 3.6.1
  • OS: Windows 7
  • IDE: PyCharm
  • Chrome browser installed
  • ChromeDriver configured
  • Selenium 3.7.0

chromeOptions

chromeOptions is a class for configuring Chrome startup properties. Through this class, we can set the following parameters (as seen in Selenium source code):

  • Set Chrome binary location (binary_location)
  • Add startup arguments (add_argument)
  • Add extensions (add_extension, add_encoded_extension)
  • Add experimental options (add_experimental_option)
  • Set debugger address (debugger_address)

Source code snippet:

# .\Lib\site-packages\selenium\webdriver\chrome\options.py
class Options(object):
    def __init__(self):
        self._binary_location = ''
        self._arguments = []
        self._extension_files = []
        self._extensions = []
        self._experimental_options = {}
        self._debugger_address = None

Usage example:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
driver = webdriver.Chrome(chrome_options=options)

Common Configurations

1. Set Encoding

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
driver = webdriver.Chrome(chrome_options=options)

2. Simulate Mobile Device

Mobile device user-agent list: http://www.fynas.com/ua

# Simulate Android QQ browser
options.add_argument('user-agent="MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')

# Simulate iPhone 6
options.add_argument('user-agent="Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"')

3. Disable Image Loading

Disabling images can improve page load speed.

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_window_size(configure.windowHeight, configure.windowWidth)
wait = WebDriverWait(driver, timeout=configure.timeoutMain)

4. Add Proxy

When adding a proxy, prefer static IPs for better stability. Dynamic proxies may have short lifetimes (1-3 minutes).

from selenium import webdriver

PROXY = "proxy_host:proxy_port"
options = webdriver.ChromeOptions()
desired_capabilities = options.to_capabilities()
desired_capabilities['proxy'] = {
    "httpProxy": PROXY,
    "ftpProxy": PROXY,
    "sslProxy": PROXY,
    "noProxy": None,
    "proxyType": "MANUAL",
    "class": "org.openqa.selenium.Proxy",
    "autodetect": False
}
driver = webdriver.Chrome(desired_capabilities=desired_capabilities)

5. Browser Settings

Selenium typically launches a clean browser without extensions. To modify settings like Flash permisssions or clear cookies, one approach is to navigate to chrome://settings/content and automate the configuration.

6. Add Extensions

To load extensions, download the .crx file and use add_extension.

Example: Loading XPath Helper

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()

# Set extension path
extension_path = 'D:/extension/XPath-Helper_v2.0.2.crx'
chrome_options.add_extension(extension_path)

driver = webdriver.Chrome(chrome_options=chrome_options)

Note:

  • Minimize the number of extensions for better performance.
  • Attempting to load all Chrome configurations via user-data-dir may cause crashes and is not recommended.

Additional Parameters

Chrome URL Commands

Enter these in the address bar:

  • about:version - Show version
  • about:memory - Memory usage
  • about:plugins - Installed plugins
  • about:histograms - History
  • about:dns - DNS status
  • about:cache - Cached pages
  • about:gpu - GPU hardware acceleration
  • about:flags - Experimental features
  • chrome://extensions/ - Extensions list

Useful Command-Line Arguments

These can be passed via add_argument:

  • --user-data-dir=[PATH] - Specify user data directory
  • --disk-cache-dir=[PATH] - Cache directory
  • --disk-cache-size=N - Cache size in bytes
  • --first-run - Reset to initial state
  • --incognito - Incognito mode
  • --disable-javascript - Disable JavaScript
  • --user-agent="..." - Custom user agent
  • --disable-plugins - Disable all plugins
  • --start-maximized - Start maximized
  • --no-sandbox - Disable sandbox (use with caution)
  • --single-process - Single process
  • --disable-popup-blocking - Disable popup blocker
  • --disable-images - Disable images
  • --lang=zh-CN - Set language
  • --proxy-pac-url=URL - Use PAC proxy
  • --enable-sync - Enable bookmark sync

Source: Adapted from CSDN blog

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.