Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Combining Spider and CrawlSpider with Middleware and Simulated Login in Scrapy

Tech May 11 2

Mixing Spider and CrawlSpider Behaviors

It's possible to combine the extraction logic of CrawlSpider with the manual request handling of a regular Spider. For instance, you might use CrawlSpider rules to follow links and collect intermediate data, then make additional requests manually to scrape detailed pages.

rules = (
    # Follow author profile pages
    Rule(LinkExtractor(allow=r'blog.csdn.net/\w+$'), follow=True),
    # Pagination for the author's articles
    Rule(LinkExtractor(allow=r'channelid=\d+&page=\d+$'), follow=True),
    # Article list pages – extract data here
    Rule(LinkExtractor(allow=r'/\w+/article/list/\d+$'), follow=True, callback='collect_articles'),
)

def collect_articles(self, response):
    item = {}
    # ... extract data from list page into item ...
    # Then manually request each article's detail page
    yield scrapy.Request(
        url=detail_url,
        callback=self.parse_detail,
        meta={'item': item}
    )

def parse_detail(self, response):
    # process the detail page
    pass

Custom Downloader Middleware

Downloader middlewares sit between the engine and the downloader. They can process both outgoing requests and incoming responses. Their typical uses include adding headers, proxies, handling retries, or decompressing responses.

Adding Random User-Agent via Middleware

  1. Define a list of User-Agent strings in settings.py:

    USER_AGENTS = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]
    
  2. In middlewares.py, create a middleware class that picks a random User-Agent:

    import random
    
    class RandomUserAgent(object):
        def process_request(self, request, spider):
            agents = spider.settings.get('USER_AGENTS')
            request.headers['User-Agent'] = random.choice(agents)
    
  3. Enable the middleware in settings.py (optional priority):

    DOWNLOADER_MIDDLEWARES = {
        'your_project.middlewares.RandomUserAgent': 10,
    }
    
  4. To verify, you can log the User-Agent on each response:

    print(response.request.headers['User-Agent'])
    

Simulated Login in Scrapy

Scrapy provides two primary ways to handle login: using pre‑obtained cookies or sending POST requests with credentials.

Method 1: Using a Cookie String

This approach works well when cookies are long‑lived and you can obtain them beforehand (e.g., via Selenium or manual extraction).

Example: Logging into Renren

Create the spider:

scrapy startproject login
cd login
scrapy genspider renren renren.com

Override start_requests to pass cookies as a dictionary:

import scrapy
import re

class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://www.renren.com/960734501/profile']

    def start_requests(self):
        raw_cookies = "anonymid=j8k2lo0cxzvxt2; _r01_=1; springskin=set; depovince=BJ; ..."
        cookies = {item.split('=')[0]: item.split('=')[1] for item in raw_cookies.split('; ')}
        for url in self.start_urls:
            yield scrapy.Request(url, cookies=cookies, callback=self.parse)

    def parse(self, response):
        print(re.findall(r'张彪', response.body.decode()))

Cookies need to be passed as a dict to the cookies parameter, not in the headers. Subsequent requests within the same session generally keep the cookies.

Enable cookie debugging in settings.py if needed:

COOKIES_DEBUG = True

Method 2: Sending POST Requests with Form Data

When a website requires a form submission, use scrapy.FormRequest to send the data.

Example: Logging into GitHub

  1. Inspect the login page (https://github.com/login) to identify the form fields. Common fields include:

    • commit
    • utf8
    • authenticity_token
    • login
    • password
  2. Create the spider:

    scrapy genspider github github.com
    
  3. Extract the required tokens from the login page and submit the form:

    import scrapy
    import re
    
    class GithubSpider(scrapy.Spider):
        name = 'github'
        allowed_domains = ['github.com']
        start_urls = ['https://github.com/login']
    
        def parse(self, response):
            auth_token = response.xpath("//input[@name='authenticity_token']/@value").get()
            commit_val = response.xpath("//input[@name='commit']/@value").get()
            utf8_val = response.xpath("//input[@name='utf8']/@value").get()
    
            post_url = 'https://github.com/session'
            form_data = {
                'commit': commit_val,
                'utf8': utf8_val,
                'authenticity_token': auth_token,
                'login': 'your_email@example.com',
                'password': 'your_password'
            }
            yield scrapy.FormRequest(post_url, formdata=form_data, callback=self.after_login)
    
        def after_login(self, response):
            print(re.findall(r'(your_username)', response.body.decode(), re.I))
    

Automatic Form Submission with from_response

For simplicity, Scrapy can automatically locate and fill the login form using scrapy.FormRequest.from_response. This works well when the page contains exactly one login form.

Example:

scrapy genspider github2 github.com
import scrapy
import re

class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response,
            formdata={'login': 'your_email@example.com', 'password': 'your_password'},
            callback=self.after_login
        )

    def after_login(self, response):
        print(re.findall('your_username', response.body.decode(), re.I))

This method automatically extracts hidden fields like authenticity_token and utf8. If multiple forms exist, you can specify the form number or CSS selector.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.