Home > Tech > Content

Combining Spider and CrawlSpider with Middleware and Simulated Login in Scrapy

Tech May 11 2

Mixing Spider and CrawlSpider Behaviors

It's possible to combine the extraction logic of CrawlSpider with the manual request handling of a regular Spider. For instance, you might use CrawlSpider rules to follow links and collect intermediate data, then make additional requests manually to scrape detailed pages.

rules = (
    # Follow author profile pages
    Rule(LinkExtractor(allow=r'blog.csdn.net/\w+$'), follow=True),
    # Pagination for the author's articles
    Rule(LinkExtractor(allow=r'channelid=\d+&page=\d+$'), follow=True),
    # Article list pages – extract data here
    Rule(LinkExtractor(allow=r'/\w+/article/list/\d+$'), follow=True, callback='collect_articles'),
)

def collect_articles(self, response):
    item = {}
    # ... extract data from list page into item ...
    # Then manually request each article's detail page
    yield scrapy.Request(
        url=detail_url,
        callback=self.parse_detail,
        meta={'item': item}
    )

def parse_detail(self, response):
    # process the detail page
    pass

Custom Downloader Middleware

Downloader middlewares sit between the engine and the downloader. They can process both outgoing requests and incoming responses. Their typical uses include adding headers, proxies, handling retries, or decompressing responses.

Adding Random User-Agent via Middleware

Define a list of User-Agent strings in settings.py:

USER_AGENTS = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]

In middlewares.py, create a middleware class that picks a random User-Agent:

import random

class RandomUserAgent(object):
    def process_request(self, request, spider):
        agents = spider.settings.get('USER_AGENTS')
        request.headers['User-Agent'] = random.choice(agents)

Enable the middleware in settings.py (optional priority):

DOWNLOADER_MIDDLEWARES = {
    'your_project.middlewares.RandomUserAgent': 10,
}

To verify, you can log the User-Agent on each response:
```
print(response.request.headers['User-Agent'])
```

Simulated Login in Scrapy

Scrapy provides two primary ways to handle login: using pre‑obtained cookies or sending POST requests with credentials.

Method 1: Using a Cookie String

This approach works well when cookies are long‑lived and you can obtain them beforehand (e.g., via Selenium or manual extraction).

Example: Logging into Renren

Create the spider:

scrapy startproject login
cd login
scrapy genspider renren renren.com

Override start_requests to pass cookies as a dictionary:

import scrapy
import re

class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://www.renren.com/960734501/profile']

    def start_requests(self):
        raw_cookies = "anonymid=j8k2lo0cxzvxt2; _r01_=1; springskin=set; depovince=BJ; ..."
        cookies = {item.split('=')[0]: item.split('=')[1] for item in raw_cookies.split('; ')}
        for url in self.start_urls:
            yield scrapy.Request(url, cookies=cookies, callback=self.parse)

    def parse(self, response):
        print(re.findall(r'张彪', response.body.decode()))

Cookies need to be passed as a dict to the cookies parameter, not in the headers. Subsequent requests within the same session generally keep the cookies.

Enable cookie debugging in settings.py if needed:

COOKIES_DEBUG = True

Method 2: Sending POST Requests with Form Data

When a website requires a form submission, use scrapy.FormRequest to send the data.

Example: Logging into GitHub

Inspect the login page (https://github.com/login) to identify the form fields. Common fields include:
- commit
- utf8
- authenticity_token
- login
- password
Create the spider:
```
scrapy genspider github github.com
```

Extract the required tokens from the login page and submit the form:

import scrapy
import re

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        auth_token = response.xpath("//input[@name='authenticity_token']/@value").get()
        commit_val = response.xpath("//input[@name='commit']/@value").get()
        utf8_val = response.xpath("//input[@name='utf8']/@value").get()

        post_url = 'https://github.com/session'
        form_data = {
            'commit': commit_val,
            'utf8': utf8_val,
            'authenticity_token': auth_token,
            'login': 'your_email@example.com',
            'password': 'your_password'
        }
        yield scrapy.FormRequest(post_url, formdata=form_data, callback=self.after_login)

    def after_login(self, response):
        print(re.findall(r'(your_username)', response.body.decode(), re.I))

Automatic Form Submission with `from_response`

For simplicity, Scrapy can automatically locate and fill the login form using scrapy.FormRequest.from_response. This works well when the page contains exactly one login form.

Example:

scrapy genspider github2 github.com

import scrapy
import re

class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response,
            formdata={'login': 'your_email@example.com', 'password': 'your_password'},
            callback=self.after_login
        )

    def after_login(self, response):
        print(re.findall('your_username', response.body.decode(), re.I))

This method automatically extracts hidden fields like authenticity_token and utf8. If multiple forms exist, you can specify the form number or CSS selector.

Tags: scrapy CrawlSpider

Back to List

Prev: Redis Data Durability: RDB Snapshots and AOF Logs

Next: Understanding Shell Types and Environment Configuration in Unix-like Systems

Fading Coder

Combining Spider and CrawlSpider with Middleware and Simulated Login in Scrapy

Mixing Spider and CrawlSpider Behaviors

Custom Downloader Middleware

Adding Random User-Agent via Middleware

Simulated Login in Scrapy

Method 1: Using a Cookie String

Method 2: Sending POST Requests with Form Data

Automatic Form Submission with `from_response`

Related Articles

Understanding Strong and Weak References in Java

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Combining Spider and CrawlSpider with Middleware and Simulated Login in Scrapy

Mixing Spider and CrawlSpider Behaviors

Custom Downloader Middleware

Adding Random User-Agent via Middleware

Simulated Login in Scrapy

Method 1: Using a Cookie String

Method 2: Sending POST Requests with Form Data

Automatic Form Submission with from_response

Related Articles

Understanding Strong and Weak References in Java

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Automatic Form Submission with `from_response`

Leave a Comment