Home > Notes > Content

Building Web Crawlers with ThreadPoolExecutor and Scrapy: Hands-On Basics

Notes 1

ThreadPoolExecutor Implementation via as_completed and map

Here’s a modified ThreadPoolExecutor map approach with parameterized crawling and error handling structure:

from concurrent.futures import ThreadPoolExecutor

def fetch_game_page(url_id):
    base = "https://demo.gameportal.com/flash/"
    return requests.get(f"{base}{url_id}", timeout=5).status_code

if __name__ == "__main__":
    with ThreadPoolExecutor(max_workers=8) as executor:
        page_ids = [f"page_{n}.html" for n in range(1, 6)]
        status_codes = executor.map(fetch_game_page, page_ids)
        for idx, code in enumerate(status_codes, 1):
            print(f"Page {idx} status: {code}")

Scrapy Crawler Setup and Pipeline Flow

Scrapy’s core architecture includes five components: Engine, Downloader, Scheduler, Spiders, Item Pipelines. The first three operate automatically, requiring no direct coding. The remaining Spiders and Pipelines are developer-defined.

Step-by-Step Scrapy Workflow

Initialize a new project driectory: scrapy startproject game_collector
Navigate into the project: cd game_collector
Generate a dedicated spider template: scrapy genspider portal_crawler demo.gameportal.com
Adjust start_urls in the generated spider file if needed
Implement data extraction in the parse() method
Define Item Pipelines for data processing/storage
Enable pipelines in settings.py
Execute the crawler: scrapy crawl portal_crawler

Modified Spider Code with XPath and Yield

import scrapy

class PortalCrawlerSpider(scrapy.Spider):
    name = "portal_crawler"
    allowed_domains = ["demo.gameportal.com"]
    start_urls = ["https://demo.gameportal.com/flash/"]

    def parse(self, response):
        game_blocks = response.xpath("//ul[contains(@class, 'game-grid')]/li")
        for block in game_blocks:
            game_data = {
                "title": block.xpath("./div/a/h3/text()").extract_first(),
                "category": block.xpath("./div/span[1]/a/text()").extract_first(),
                "rating": block.xpath("./div/div[@class='score']/text()").extract_first(default="N/A")
            }
            yield game_data

Scrapy Data Extraction Helpers

response.xpath() / response.css(): Return SelectorList objects
extract(): Convert entire SelectorList to a Python list of strings
extract_first(): Return the first valid string from the list, or None/custom default if empty

Pipeline Configuration

Pipeline classes process items sequentially. They must be explicitly enabled in settings.py with numeric priorities (lower values = higher execution order).

# pipelines.py
class CleanDataPipeline:
    def process_item(self, item, spider):
        # Trim whitespace from all string values
        for key, value in item.items():
            if isinstance(value, str):
                item[key] = value.strip()
        return item

class CsvExportPipeline:
    def __init__(self):
        self.file = None
    
    def open_spider(self, spider):
        self.file = open("games.csv", "w", encoding="utf-8")
        self.file.write("title,category,rating\n")
    
    def close_spider(self, spider):
        if self.file:
            self.file.close()
    
    def process_item(self, item, spider):
        line = f"{item['title']},{item['category']},{item['rating']}\n"
        self.file.write(line)
        return item

# settings.py
ITEM_PIPELINES = {
    "game_collector.pipelines.CleanDataPipeline": 100,
    "game_collector.pipelines.CsvExportPipeline": 200
}

Tags: Python

Back to List

Prev: Docker-Based SVN Server and WebSVN Configuration

Next: Python String Manipulation Techniques: Replacement, Deletion, Slicing, Concatenation, and More

Fading Coder

Building Web Crawlers with ThreadPoolExecutor and Scrapy: Hands-On Basics

ThreadPoolExecutor Implementation via as_completed and map

Scrapy Crawler Setup and Pipeline Flow

Step-by-Step Scrapy Workflow

Modified Spider Code with XPath and Yield

Scrapy Data Extraction Helpers

Pipeline Configuration

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication