Home > Tech > Content

Implementing Concurrent Web Scraping with Multiprocessing and Multithreading

Tech May 16 1

Concurrency in Web Scraping

Web scraping can be significant accelerated by leveraging concurrency through multiprocessing or multithreading. Understanding the distinction between processes and threads is essential for choosing the right approach.

A process represents the smallest unit of resource allocation in an operating system, while a thread is the smallest schedulable unit within a process. When a Python script runs, it spawns a single process containing at least one thread that executes the code.

Spawning Processes

Using the multiprocessing module, new processes can be created to run tasks in paralel:

import time
from multiprocessing import Process

def background_task():
    print('Task started')
    time.sleep(3)
    print('Task completed')

if __name__ == '__main__':
    worker = Process(target=background_task)
    worker.start()

Alternatively, custom process classes can encapsulate logic:

import time
from multiprocessing import Process

class CustomWorker(Process):
    def __init__(self, payload):
        super().__init__()
        self.payload = payload

    def run(self):
        print('Custom task started')
        print(f'Processing: {self.payload}')
        time.sleep(3)
        print('Custom task finished')

if __name__ == '__main__':
    job = CustomWorker("sample_data")
    job.start()

Key Characteristics of Processes

Child processes are forked from the parent and inherit its memory space at creation time.
Each child process executes only its designated target function; other code runs in the parent.
Processes operate independently with no guaranteed execution order.
Threads within the same process share memory and resources.

The Global Interpreter Lock (GIL)

The GIL is a mechanism specific to CPython (the standard Python implementation), not the Python language itself. It ensures that only one thread executes Python bytecode at a time within a single process. Consequently:

To utilize multiple CPU cores effectively, use multiprocessing.
For I/O-bound tasks where threads spend time waiting (e.g., network requests), multithreading remains efficient despite the GIL.

Example: Concurrent Image Scraper for League of Legends Skins

The following scraper fetches hero skin images using a thread pool for concurrent downloads:

import os
import requests
from concurrent.futures import ThreadPoolExecutor

class LoLSkinScraper:
    def __init__(self):
        self.hero_list_endpoint = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js?ts=2795830'
        self.hero_detail_template = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js?ts=2795830'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)'
        }
        self.downloader_pool = ThreadPoolExecutor(max_workers=10)

    def fetch_hero_ids(self):
        response = requests.get(self.hero_list_endpoint, headers=self.headers)
        heroes = response.json()['hero']
        for hero in heroes:
            detail_url = self.hero_detail_template.format(hero['heroId'])
            hero_data = requests.get(detail_url, headers=self.headers).json()
            self.extract_skin_urls(hero_data)

    def extract_skin_urls(self, hero_data):
        for skin in hero_data['skins']:
            skin_name = skin['name']
            image_url = skin['mainImg']
            if image_url:
                self.downloader_pool.submit(self.download_image, skin_name, image_url)

    def download_image(self, name, url):
        image_content = requests.get(url, headers=self.headers).content
        output_dir = 'lol_skins'
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        safe_filename = f"{output_dir}/{name.replace('/', '_')}.jpg"
        with open(safe_filename, 'wb') as file:
            file.write(image_content)
        print(f"Downloaded: {name}")

    def execute(self):
        self.fetch_hero_ids()

if __name__ == '__main__':
    scraper = LoLSkinScraper()
    scraper.execute()

Tags: Web Scraping Concurrency

Back to List

Prev: Python Web Scraping: Single-Threaded vs Multi-Threaded Approaches

Next: Creating Density-Encoded Scatter Plots for Large Datasets in Python

Fading Coder

Implementing Concurrent Web Scraping with Multiprocessing and Multithreading

Concurrency in Web Scraping

Spawning Processes

Key Characteristics of Processes

The Global Interpreter Lock (GIL)

Example: Concurrent Image Scraper for League of Legends Skins

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Implementing Concurrent Web Scraping with Multiprocessing and Multithreading

Concurrency in Web Scraping

Spawning Processes

Key Characteristics of Processes

The Global Interpreter Lock (GIL)

Example: Concurrent Image Scraper for League of Legends Skins

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment