Implementing Concurrent Web Scraping with Multiprocessing and Multithreading
Concurrency in Web Scraping
Web scraping can be significant accelerated by leveraging concurrency through multiprocessing or multithreading. Understanding the distinction between processes and threads is essential for choosing the right approach.
A process represents the smallest unit of resource allocation in an operating system, while a thread is the smallest schedulable unit within a process. When a Python script runs, it spawns a single process containing at least one thread that executes the code.
Spawning Processes
Using the multiprocessing module, new processes can be created to run tasks in paralel:
import time
from multiprocessing import Process
def background_task():
print('Task started')
time.sleep(3)
print('Task completed')
if __name__ == '__main__':
worker = Process(target=background_task)
worker.start()
Alternatively, custom process classes can encapsulate logic:
import time
from multiprocessing import Process
class CustomWorker(Process):
def __init__(self, payload):
super().__init__()
self.payload = payload
def run(self):
print('Custom task started')
print(f'Processing: {self.payload}')
time.sleep(3)
print('Custom task finished')
if __name__ == '__main__':
job = CustomWorker("sample_data")
job.start()
Key Characteristics of Processes
- Child processes are forked from the parent and inherit its memory space at creation time.
- Each child process executes only its designated target function; other code runs in the parent.
- Processes operate independently with no guaranteed execution order.
- Threads within the same process share memory and resources.
The Global Interpreter Lock (GIL)
The GIL is a mechanism specific to CPython (the standard Python implementation), not the Python language itself. It ensures that only one thread executes Python bytecode at a time within a single process. Consequently:
- To utilize multiple CPU cores effectively, use multiprocessing.
- For I/O-bound tasks where threads spend time waiting (e.g., network requests), multithreading remains efficient despite the GIL.
Example: Concurrent Image Scraper for League of Legends Skins
The following scraper fetches hero skin images using a thread pool for concurrent downloads:
import os
import requests
from concurrent.futures import ThreadPoolExecutor
class LoLSkinScraper:
def __init__(self):
self.hero_list_endpoint = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js?ts=2795830'
self.hero_detail_template = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js?ts=2795830'
self.headers = {
'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)'
}
self.downloader_pool = ThreadPoolExecutor(max_workers=10)
def fetch_hero_ids(self):
response = requests.get(self.hero_list_endpoint, headers=self.headers)
heroes = response.json()['hero']
for hero in heroes:
detail_url = self.hero_detail_template.format(hero['heroId'])
hero_data = requests.get(detail_url, headers=self.headers).json()
self.extract_skin_urls(hero_data)
def extract_skin_urls(self, hero_data):
for skin in hero_data['skins']:
skin_name = skin['name']
image_url = skin['mainImg']
if image_url:
self.downloader_pool.submit(self.download_image, skin_name, image_url)
def download_image(self, name, url):
image_content = requests.get(url, headers=self.headers).content
output_dir = 'lol_skins'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
safe_filename = f"{output_dir}/{name.replace('/', '_')}.jpg"
with open(safe_filename, 'wb') as file:
file.write(image_content)
print(f"Downloaded: {name}")
def execute(self):
self.fetch_hero_ids()
if __name__ == '__main__':
scraper = LoLSkinScraper()
scraper.execute()