Building Web Crawlers with ThreadPoolExecutor and Scrapy: Hands-On Basics
ThreadPoolExecutor Implementation via as_completed and map
Here’s a modified ThreadPoolExecutor map approach with parameterized crawling and error handling structure:
from concurrent.futures import ThreadPoolExecutor
def fetch_game_page(url_id):
base = "https://demo.gameportal.com/flash/"
return requests.get(f"{base}{url_id}", timeout=5).status_code
if __name__ == "__main__":
with ThreadPoolExecutor(max_workers=8) as executor:
page_ids = [f"page_{n}.html" for n in range(1, 6)]
status_codes = executor.map(fetch_game_page, page_ids)
for idx, code in enumerate(status_codes, 1):
print(f"Page {idx} status: {code}")
Scrapy Crawler Setup and Pipeline Flow
Scrapy’s core architecture includes five components: Engine, Downloader, Scheduler, Spiders, Item Pipelines. The first three operate automatically, requiring no direct coding. The remaining Spiders and Pipelines are developer-defined.
Step-by-Step Scrapy Workflow
- Initialize a new project driectory:
scrapy startproject game_collector - Navigate into the project:
cd game_collector - Generate a dedicated spider template:
scrapy genspider portal_crawler demo.gameportal.com - Adjust
start_urlsin the generated spider file if needed - Implement data extraction in the
parse()method - Define Item Pipelines for data processing/storage
- Enable pipelines in
settings.py - Execute the crawler:
scrapy crawl portal_crawler
Modified Spider Code with XPath and Yield
import scrapy
class PortalCrawlerSpider(scrapy.Spider):
name = "portal_crawler"
allowed_domains = ["demo.gameportal.com"]
start_urls = ["https://demo.gameportal.com/flash/"]
def parse(self, response):
game_blocks = response.xpath("//ul[contains(@class, 'game-grid')]/li")
for block in game_blocks:
game_data = {
"title": block.xpath("./div/a/h3/text()").extract_first(),
"category": block.xpath("./div/span[1]/a/text()").extract_first(),
"rating": block.xpath("./div/div[@class='score']/text()").extract_first(default="N/A")
}
yield game_data
Scrapy Data Extraction Helpers
response.xpath()/response.css(): Return SelectorList objectsextract(): Convert entire SelectorList to a Python list of stringsextract_first(): Return the first valid string from the list, orNone/custom default if empty
Pipeline Configuration
Pipeline classes process items sequentially. They must be explicitly enabled in settings.py with numeric priorities (lower values = higher execution order).
# pipelines.py
class CleanDataPipeline:
def process_item(self, item, spider):
# Trim whitespace from all string values
for key, value in item.items():
if isinstance(value, str):
item[key] = value.strip()
return item
class CsvExportPipeline:
def __init__(self):
self.file = None
def open_spider(self, spider):
self.file = open("games.csv", "w", encoding="utf-8")
self.file.write("title,category,rating\n")
def close_spider(self, spider):
if self.file:
self.file.close()
def process_item(self, item, spider):
line = f"{item['title']},{item['category']},{item['rating']}\n"
self.file.write(line)
return item
# settings.py
ITEM_PIPELINES = {
"game_collector.pipelines.CleanDataPipeline": 100,
"game_collector.pipelines.CsvExportPipeline": 200
}