Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing Concurrent Web Scraping with Multiprocessing and Multithreading

Tech May 16 1

Concurrency in Web Scraping

Web scraping can be significant accelerated by leveraging concurrency through multiprocessing or multithreading. Understanding the distinction between processes and threads is essential for choosing the right approach.

A process represents the smallest unit of resource allocation in an operating system, while a thread is the smallest schedulable unit within a process. When a Python script runs, it spawns a single process containing at least one thread that executes the code.

Spawning Processes

Using the multiprocessing module, new processes can be created to run tasks in paralel:

import time
from multiprocessing import Process

def background_task():
    print('Task started')
    time.sleep(3)
    print('Task completed')

if __name__ == '__main__':
    worker = Process(target=background_task)
    worker.start()

Alternatively, custom process classes can encapsulate logic:

import time
from multiprocessing import Process

class CustomWorker(Process):
    def __init__(self, payload):
        super().__init__()
        self.payload = payload

    def run(self):
        print('Custom task started')
        print(f'Processing: {self.payload}')
        time.sleep(3)
        print('Custom task finished')

if __name__ == '__main__':
    job = CustomWorker("sample_data")
    job.start()

Key Characteristics of Processes

  • Child processes are forked from the parent and inherit its memory space at creation time.
  • Each child process executes only its designated target function; other code runs in the parent.
  • Processes operate independently with no guaranteed execution order.
  • Threads within the same process share memory and resources.

The Global Interpreter Lock (GIL)

The GIL is a mechanism specific to CPython (the standard Python implementation), not the Python language itself. It ensures that only one thread executes Python bytecode at a time within a single process. Consequently:

  • To utilize multiple CPU cores effectively, use multiprocessing.
  • For I/O-bound tasks where threads spend time waiting (e.g., network requests), multithreading remains efficient despite the GIL.

Example: Concurrent Image Scraper for League of Legends Skins

The following scraper fetches hero skin images using a thread pool for concurrent downloads:

import os
import requests
from concurrent.futures import ThreadPoolExecutor

class LoLSkinScraper:
    def __init__(self):
        self.hero_list_endpoint = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js?ts=2795830'
        self.hero_detail_template = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js?ts=2795830'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)'
        }
        self.downloader_pool = ThreadPoolExecutor(max_workers=10)

    def fetch_hero_ids(self):
        response = requests.get(self.hero_list_endpoint, headers=self.headers)
        heroes = response.json()['hero']
        for hero in heroes:
            detail_url = self.hero_detail_template.format(hero['heroId'])
            hero_data = requests.get(detail_url, headers=self.headers).json()
            self.extract_skin_urls(hero_data)

    def extract_skin_urls(self, hero_data):
        for skin in hero_data['skins']:
            skin_name = skin['name']
            image_url = skin['mainImg']
            if image_url:
                self.downloader_pool.submit(self.download_image, skin_name, image_url)

    def download_image(self, name, url):
        image_content = requests.get(url, headers=self.headers).content
        output_dir = 'lol_skins'
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        safe_filename = f"{output_dir}/{name.replace('/', '_')}.jpg"
        with open(safe_filename, 'wb') as file:
            file.write(image_content)
        print(f"Downloaded: {name}")

    def execute(self):
        self.fetch_hero_ids()

if __name__ == '__main__':
    scraper = LoLSkinScraper()
    scraper.execute()

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.