Combining Spider and CrawlSpider with Middleware and Simulated Login in Scrapy
Mixing Spider and CrawlSpider Behaviors
It's possible to combine the extraction logic of CrawlSpider with the manual request handling of a regular Spider. For instance, you might use CrawlSpider rules to follow links and collect intermediate data, then make additional requests manually to scrape detailed pages.
rules = (
# Follow author profile pages
Rule(LinkExtractor(allow=r'blog.csdn.net/\w+$'), follow=True),
# Pagination for the author's articles
Rule(LinkExtractor(allow=r'channelid=\d+&page=\d+$'), follow=True),
# Article list pages – extract data here
Rule(LinkExtractor(allow=r'/\w+/article/list/\d+$'), follow=True, callback='collect_articles'),
)
def collect_articles(self, response):
item = {}
# ... extract data from list page into item ...
# Then manually request each article's detail page
yield scrapy.Request(
url=detail_url,
callback=self.parse_detail,
meta={'item': item}
)
def parse_detail(self, response):
# process the detail page
pass
Custom Downloader Middleware
Downloader middlewares sit between the engine and the downloader. They can process both outgoing requests and incoming responses. Their typical uses include adding headers, proxies, handling retries, or decompressing responses.
Adding Random User-Agent via Middleware
-
Define a list of User-Agent strings in
settings.py:USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ] -
In
middlewares.py, create a middleware class that picks a random User-Agent:import random class RandomUserAgent(object): def process_request(self, request, spider): agents = spider.settings.get('USER_AGENTS') request.headers['User-Agent'] = random.choice(agents) -
Enable the middleware in
settings.py(optional priority):DOWNLOADER_MIDDLEWARES = { 'your_project.middlewares.RandomUserAgent': 10, } -
To verify, you can log the User-Agent on each response:
print(response.request.headers['User-Agent'])
Simulated Login in Scrapy
Scrapy provides two primary ways to handle login: using pre‑obtained cookies or sending POST requests with credentials.
Method 1: Using a Cookie String
This approach works well when cookies are long‑lived and you can obtain them beforehand (e.g., via Selenium or manual extraction).
Example: Logging into Renren
Create the spider:
scrapy startproject login
cd login
scrapy genspider renren renren.com
Override start_requests to pass cookies as a dictionary:
import scrapy
import re
class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://www.renren.com/960734501/profile']
def start_requests(self):
raw_cookies = "anonymid=j8k2lo0cxzvxt2; _r01_=1; springskin=set; depovince=BJ; ..."
cookies = {item.split('=')[0]: item.split('=')[1] for item in raw_cookies.split('; ')}
for url in self.start_urls:
yield scrapy.Request(url, cookies=cookies, callback=self.parse)
def parse(self, response):
print(re.findall(r'张彪', response.body.decode()))
Cookies need to be passed as a dict to the cookies parameter, not in the headers. Subsequent requests within the same session generally keep the cookies.
Enable cookie debugging in settings.py if needed:
COOKIES_DEBUG = True
Method 2: Sending POST Requests with Form Data
When a website requires a form submission, use scrapy.FormRequest to send the data.
Example: Logging into GitHub
-
Inspect the login page (
https://github.com/login) to identify the form fields. Common fields include:commitutf8authenticity_tokenloginpassword
-
Create the spider:
scrapy genspider github github.com -
Extract the required tokens from the login page and submit the form:
import scrapy import re class GithubSpider(scrapy.Spider): name = 'github' allowed_domains = ['github.com'] start_urls = ['https://github.com/login'] def parse(self, response): auth_token = response.xpath("//input[@name='authenticity_token']/@value").get() commit_val = response.xpath("//input[@name='commit']/@value").get() utf8_val = response.xpath("//input[@name='utf8']/@value").get() post_url = 'https://github.com/session' form_data = { 'commit': commit_val, 'utf8': utf8_val, 'authenticity_token': auth_token, 'login': 'your_email@example.com', 'password': 'your_password' } yield scrapy.FormRequest(post_url, formdata=form_data, callback=self.after_login) def after_login(self, response): print(re.findall(r'(your_username)', response.body.decode(), re.I))
Automatic Form Submission with from_response
For simplicity, Scrapy can automatically locate and fill the login form using scrapy.FormRequest.from_response. This works well when the page contains exactly one login form.
Example:
scrapy genspider github2 github.com
import scrapy
import re
class Github2Spider(scrapy.Spider):
name = 'github2'
allowed_domains = ['github.com']
start_urls = ['http://github.com/login']
def parse(self, response):
yield scrapy.FormRequest.from_response(
response,
formdata={'login': 'your_email@example.com', 'password': 'your_password'},
callback=self.after_login
)
def after_login(self, response):
print(re.findall('your_username', response.body.decode(), re.I))
This method automatically extracts hidden fields like authenticity_token and utf8. If multiple forms exist, you can specify the form number or CSS selector.