Scrapy Framework Fundamentals and Advanced Usage
Project Initialization and Execution
To create a new Scrapy project:
scrapy startproject project_name
cd project_name
scrapy genspider spider_name domain.com
Run the spider with:
scrapy crawl spider_name
If dependency errors occur, install compatible versions:
pip install Twisted==22.10.0 urllib3==1.26.18 parsel==1.7.0
Disable robots.txt compliacne in settings.py:
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'WARNING' # Reduce log verbosity
Core Components and Execution Flow
Scrapy’s architecture involves coordinated interaction between components:
- The spider provides
start_urls, converted intoRequestobjects. - Requests pass through spider middleware.
- The engine receives requests and forwards them to the scheduler.
- The scheduler queues requests and returns them to the engine.
- The engine sends requests to downloader middleware (for headers, proxies, retries).
- The downloader executes asynchronous HTTP(S) requests and returns
Responseobjects. - Responses traverse downloader middleware, then spider middleware.
- The spider’s
parse()method processes responses (e.g., using XPath). - Extracted items are sent to pipelines for storage (CSV, MongoDB, etc.).
Both downloader and spider middleware can modify requests/responses—middleware choice depends on processing stage.
Multi-Type Data Handling Example
Spider (fm/spiders/qingting.py):
import scrapy
from scrapy import cmdline
from scrapy.http import HtmlResponse
import re
class QingTingSpider(scrapy.Spider):
name = 'qingting'
start_urls = ['https://m.qingting.fm/rank/']
def parse(self, response: HtmlResponse, **kwargs):
for item in response.xpath('//div[@class="rank-list"]/a'):
data = {
'type': 'text',
'rank': item.xpath('./div[@class="badge"]/text()').get(),
'img_url': item.xpath('./img/@src').get(),
'title': item.xpath('.//div[@class="title"]/text()').get(),
'desc': item.xpath('.//div[@class="desc"]/text()').get(),
}
yield data
img_url = data['img_url']
if img_url:
yield scrapy.Request(
img_url,
callback=self.parse_image,
cb_kwargs={'name': data['title']}
)
def parse_image(self, response: HtmlResponse, name):
clean_name = re.sub(r'[\\/:*?"<>|]', '_', name)
yield {
'type': 'image',
'image_name': f"{clean_name}.png",
'image_content': response.body
}
if __name__ == '__main__':
cmdline.execute('scrapy crawl qingting'.split())
Pipeline (fm/pipelines.py):
import os
import pymongo
class FmPipeline:
def __init__(self):
self.client = pymongo.MongoClient()
self.collection = self.client['py_spider']['scrapy_fm_info']
def process_item(self, item, spider):
if item.get('type') == 'text':
self.collection.insert_one(dict(item))
elif item.get('type') == 'image':
path = os.path.join(os.getcwd(), 'download_images')
os.makedirs(path, exist_ok=True)
filename = os.path.join(path, item['image_name'])
with open(filename, 'wb') as f:
f.write(item['image_content'])
return item
def __del__(self):
self.client.close()
Enable the pipeline in settings.py:
ITEM_PIPELINES = {'fm.pipelines.FmPipeline': 300}
Middleware Examples
Random User-Agent Middleware:
import random
class RandomUAMiddleware:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
# ... more agents
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.USER_AGENTS)
Register in settings.py:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUAMiddleware': 400,
}
Proxy Integration via Custom Extension:
Use an external service (e.g., KuaiDaili) to rotate proxies. A background thread refreshes IPs every 15 seconds. The middleware injects proxy credentials into request.meta['proxy'].
Selenium Middleware for Dynamic Content:
from selenium import webdriver
from scrapy.http import HtmlResponse
class SeleniumMiddleware:
def __init__(self):
self.driver = webdriver.Chrome()
def process_request(self, request, spider):
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(url=request.url, body=body, encoding='utf-8', request=request)
def spider_closed(self, spider):
self.driver.quit()
Data Decryption with JavaScript
For sites that encrypt responses:
- Place decryption logic in
spiders/utils/decrypt.js. - Use
execjsin a downloader middleware to decryptresponse.textbefore parsing.
import execjs
class DecryptMiddleware:
def __init__(self):
with open('spiders/utils/decrypt.js') as f:
self.ctx = execjs.compile(f.read())
def process_response(self, request, response, spider):
decrypted = self.ctx.call('decryptFunction', response.text)
return HtmlResponse(url=response.url, body=decrypted, encoding='utf-8')
Deduplication Strategies
URL-Level Deduplication (Downloader Middleware):
import hashlib
import redis
from scrapy.exceptions import IgnoreRequest
class URLDedupMiddleware:
def __init__(self):
self.redis = redis.Redis()
def process_request(self, request, spider):
key = hashlib.md5(request.url.encode()).hexdigest()
if self.redis.exists(f"seen:{key}"):
raise IgnoreRequest()
self.redis.setex(f"seen:{key}", 3600, 1)
Item-Level Deduplication (Pipeline):
Hash serialized items and skip duplicates using Redis.
Distributed Crawling with Scrapy-Redis
Install scrapy-redis and configure settings.py:
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
REDIS_URL = 'redis://localhost:6379/2'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
GET-Based Distributed Spider:
Inherit from RedisSpider and define redis_key:
from scrapy_redis.spiders import RedisSpider
class Top250Spider(RedisSpider):
name = 'top250'
redis_key = 'top250:start_urls'
def parse(self, response):
# extract data
Seed URLs via Redis:
import redis
r = redis.Redis(db=2)
for i in range(5):
r.lpush('top250:start_urls', f'https://movie.douban.com/top250?start={i*25}')
POST-Based Distributed Spider:
Override make_request_from_data() to handle form or JSON payloads stored in Redis:
class JobSpider(RedisSpider):
name = 'job_info'
redis_key = 'job_info:start_payload'
def make_request_from_data(self, data):
payload = json.loads(data)['json_data']
return JsonRequest('https://api.example.com', data=payload, callback=self.parse)
Seed payloads similarly by pushing JSON strings into Redis lists.