Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Building Web Crawlers with ThreadPoolExecutor and Scrapy: Hands-On Basics

Notes 1

ThreadPoolExecutor Implementation via as_completed and map

Here’s a modified ThreadPoolExecutor map approach with parameterized crawling and error handling structure:

from concurrent.futures import ThreadPoolExecutor

def fetch_game_page(url_id):
    base = "https://demo.gameportal.com/flash/"
    return requests.get(f"{base}{url_id}", timeout=5).status_code

if __name__ == "__main__":
    with ThreadPoolExecutor(max_workers=8) as executor:
        page_ids = [f"page_{n}.html" for n in range(1, 6)]
        status_codes = executor.map(fetch_game_page, page_ids)
        for idx, code in enumerate(status_codes, 1):
            print(f"Page {idx} status: {code}")

Scrapy Crawler Setup and Pipeline Flow

Scrapy’s core architecture includes five components: Engine, Downloader, Scheduler, Spiders, Item Pipelines. The first three operate automatically, requiring no direct coding. The remaining Spiders and Pipelines are developer-defined.

Step-by-Step Scrapy Workflow

  1. Initialize a new project driectory: scrapy startproject game_collector
  2. Navigate into the project: cd game_collector
  3. Generate a dedicated spider template: scrapy genspider portal_crawler demo.gameportal.com
  4. Adjust start_urls in the generated spider file if needed
  5. Implement data extraction in the parse() method
  6. Define Item Pipelines for data processing/storage
  7. Enable pipelines in settings.py
  8. Execute the crawler: scrapy crawl portal_crawler

Modified Spider Code with XPath and Yield

import scrapy

class PortalCrawlerSpider(scrapy.Spider):
    name = "portal_crawler"
    allowed_domains = ["demo.gameportal.com"]
    start_urls = ["https://demo.gameportal.com/flash/"]

    def parse(self, response):
        game_blocks = response.xpath("//ul[contains(@class, 'game-grid')]/li")
        for block in game_blocks:
            game_data = {
                "title": block.xpath("./div/a/h3/text()").extract_first(),
                "category": block.xpath("./div/span[1]/a/text()").extract_first(),
                "rating": block.xpath("./div/div[@class='score']/text()").extract_first(default="N/A")
            }
            yield game_data

Scrapy Data Extraction Helpers

  • response.xpath() / response.css(): Return SelectorList objects
  • extract(): Convert entire SelectorList to a Python list of strings
  • extract_first(): Return the first valid string from the list, or None/custom default if empty

Pipeline Configuration

Pipeline classes process items sequentially. They must be explicitly enabled in settings.py with numeric priorities (lower values = higher execution order).

# pipelines.py
class CleanDataPipeline:
    def process_item(self, item, spider):
        # Trim whitespace from all string values
        for key, value in item.items():
            if isinstance(value, str):
                item[key] = value.strip()
        return item

class CsvExportPipeline:
    def __init__(self):
        self.file = None
    
    def open_spider(self, spider):
        self.file = open("games.csv", "w", encoding="utf-8")
        self.file.write("title,category,rating\n")
    
    def close_spider(self, spider):
        if self.file:
            self.file.close()
    
    def process_item(self, item, spider):
        line = f"{item['title']},{item['category']},{item['rating']}\n"
        self.file.write(line)
        return item
# settings.py
ITEM_PIPELINES = {
    "game_collector.pipelines.CleanDataPipeline": 100,
    "game_collector.pipelines.CsvExportPipeline": 200
}
Tags: Python

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.