Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Scrapy Framework Fundamentals and Advanced Usage

Tech 1

Project Initialization and Execution

To create a new Scrapy project:

scrapy startproject project_name
cd project_name
scrapy genspider spider_name domain.com

Run the spider with:

scrapy crawl spider_name

If dependency errors occur, install compatible versions:

pip install Twisted==22.10.0 urllib3==1.26.18 parsel==1.7.0

Disable robots.txt compliacne in settings.py:

ROBOTSTXT_OBEY = False
LOG_LEVEL = 'WARNING'  # Reduce log verbosity

Core Components and Execution Flow

Scrapy’s architecture involves coordinated interaction between components:

  1. The spider provides start_urls, converted into Request objects.
  2. Requests pass through spider middleware.
  3. The engine receives requests and forwards them to the scheduler.
  4. The scheduler queues requests and returns them to the engine.
  5. The engine sends requests to downloader middleware (for headers, proxies, retries).
  6. The downloader executes asynchronous HTTP(S) requests and returns Response objects.
  7. Responses traverse downloader middleware, then spider middleware.
  8. The spider’s parse() method processes responses (e.g., using XPath).
  9. Extracted items are sent to pipelines for storage (CSV, MongoDB, etc.).

Both downloader and spider middleware can modify requests/responses—middleware choice depends on processing stage.

Multi-Type Data Handling Example

Spider (fm/spiders/qingting.py):

import scrapy
from scrapy import cmdline
from scrapy.http import HtmlResponse
import re

class QingTingSpider(scrapy.Spider):
    name = 'qingting'
    start_urls = ['https://m.qingting.fm/rank/']

    def parse(self, response: HtmlResponse, **kwargs):
        for item in response.xpath('//div[@class="rank-list"]/a'):
            data = {
                'type': 'text',
                'rank': item.xpath('./div[@class="badge"]/text()').get(),
                'img_url': item.xpath('./img/@src').get(),
                'title': item.xpath('.//div[@class="title"]/text()').get(),
                'desc': item.xpath('.//div[@class="desc"]/text()').get(),
            }
            yield data

            img_url = data['img_url']
            if img_url:
                yield scrapy.Request(
                    img_url,
                    callback=self.parse_image,
                    cb_kwargs={'name': data['title']}
                )

    def parse_image(self, response: HtmlResponse, name):
        clean_name = re.sub(r'[\\/:*?"<>|]', '_', name)
        yield {
            'type': 'image',
            'image_name': f"{clean_name}.png",
            'image_content': response.body
        }

if __name__ == '__main__':
    cmdline.execute('scrapy crawl qingting'.split())

Pipeline (fm/pipelines.py):

import os
import pymongo

class FmPipeline:
    def __init__(self):
        self.client = pymongo.MongoClient()
        self.collection = self.client['py_spider']['scrapy_fm_info']

    def process_item(self, item, spider):
        if item.get('type') == 'text':
            self.collection.insert_one(dict(item))
        elif item.get('type') == 'image':
            path = os.path.join(os.getcwd(), 'download_images')
            os.makedirs(path, exist_ok=True)
            filename = os.path.join(path, item['image_name'])
            with open(filename, 'wb') as f:
                f.write(item['image_content'])
        return item

    def __del__(self):
        self.client.close()

Enable the pipeline in settings.py:

ITEM_PIPELINES = {'fm.pipelines.FmPipeline': 300}

Middleware Examples

Random User-Agent Middleware:

import random

class RandomUAMiddleware:
    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
        # ... more agents
    ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.USER_AGENTS)

Register in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomUAMiddleware': 400,
}

Proxy Integration via Custom Extension:

Use an external service (e.g., KuaiDaili) to rotate proxies. A background thread refreshes IPs every 15 seconds. The middleware injects proxy credentials into request.meta['proxy'].

Selenium Middleware for Dynamic Content:

from selenium import webdriver
from scrapy.http import HtmlResponse

class SeleniumMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        self.driver.get(request.url)
        body = self.driver.page_source
        return HtmlResponse(url=request.url, body=body, encoding='utf-8', request=request)

    def spider_closed(self, spider):
        self.driver.quit()

Data Decryption with JavaScript

For sites that encrypt responses:

  1. Place decryption logic in spiders/utils/decrypt.js.
  2. Use execjs in a downloader middleware to decrypt response.text before parsing.
import execjs

class DecryptMiddleware:
    def __init__(self):
        with open('spiders/utils/decrypt.js') as f:
            self.ctx = execjs.compile(f.read())

    def process_response(self, request, response, spider):
        decrypted = self.ctx.call('decryptFunction', response.text)
        return HtmlResponse(url=response.url, body=decrypted, encoding='utf-8')

Deduplication Strategies

URL-Level Deduplication (Downloader Middleware):

import hashlib
import redis
from scrapy.exceptions import IgnoreRequest

class URLDedupMiddleware:
    def __init__(self):
        self.redis = redis.Redis()

    def process_request(self, request, spider):
        key = hashlib.md5(request.url.encode()).hexdigest()
        if self.redis.exists(f"seen:{key}"):
            raise IgnoreRequest()
        self.redis.setex(f"seen:{key}", 3600, 1)

Item-Level Deduplication (Pipeline):

Hash serialized items and skip duplicates using Redis.

Distributed Crawling with Scrapy-Redis

Install scrapy-redis and configure settings.py:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
REDIS_URL = 'redis://localhost:6379/2'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

GET-Based Distributed Spider:

Inherit from RedisSpider and define redis_key:

from scrapy_redis.spiders import RedisSpider

class Top250Spider(RedisSpider):
    name = 'top250'
    redis_key = 'top250:start_urls'

    def parse(self, response):
        # extract data

Seed URLs via Redis:

import redis
r = redis.Redis(db=2)
for i in range(5):
    r.lpush('top250:start_urls', f'https://movie.douban.com/top250?start={i*25}')

POST-Based Distributed Spider:

Override make_request_from_data() to handle form or JSON payloads stored in Redis:

class JobSpider(RedisSpider):
    name = 'job_info'
    redis_key = 'job_info:start_payload'

    def make_request_from_data(self, data):
        payload = json.loads(data)['json_data']
        return JsonRequest('https://api.example.com', data=payload, callback=self.parse)

Seed payloads similarly by pushing JSON strings into Redis lists.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.