Home > Tech > Content

Web Scraping Fundamentals: Understanding Crawlers and HTTP Protocol

Tech May 17 15

Data Acquisition in the Digital Age

In today's data-driven landscape, information originates from multiple channels:

Enterprise-generated user data: Baidu Index, Alibaba Index, Tencent Browsing Index, Weibo Index
Purchased datasets: Data marketplaces and exchanges
Government/institutional open data: National statistics bureaus, World Bank, UN databases
Consulting firm reports: McKinsey, Accenture, iResearch
Web scraping: When required data isn't commercially available or purchasing isn't viable, organizations hire scraping specialists to extract information directly from websites

Crawler Classifications

Web crawlers fall into two primary categories based on application scenarios: general-purpose crawlers and focused crawlers.

General-Purpose Crawlers

General web crawlers form the backbone of search engine indexing systems (Google, Baidu, Bing). Their core mission involves downloading web pages across the internet to create mirrored content repositories for search services.

Operational Workflow

The performance of general crawlers directly impacts search engine quality by determining content richness and freshness. The process follows four distinct phases:

Phase 1: Page Discovery and Retrieval

Initialize a seed URL set and queue them for processing
Dequeue URLs, resolve DNS to IP addresses, download corresponding pages, store in a repository, and mark as crawled
Extract new URLs from downloaded content and enqueue them for subsequent crawling cycles

Search engines discover new websites through:

Direct URL submission via webmaster tools
External backlinks from existing indexed sites
Partnerships with DNS providers for rapid domain detection

Crawlers must respect specific directives including nofollow attributes and the Robots Exclusion Protocol. This standard allows websites to define crawlable paths via robots.txt files (e.g., https://www.taobao.com/robots.txt).

Phase 2: Raw Data Storage

Downloaded HTML content enters a raw page database, preserving the exact markup rendered in browsers. Crawlers perform duplicate content detection, often abandoning sites with excessive plagiarized material.

Phase 3: Content Preprocessing

Before indexing, systems execute:

Text extraction and tokenization
Noise removal (navigation, ads, copyright notices)
Link relationship analysis
Special file handling (PDF, DOC, XLS, PPT, TXT)

Note: Search engines struggle with non-text content like images, videos, and Flash, and cannot execute JavaScript.

Phase 4: Search Service and Ranking

Processed information becomes searchable through keyword queries. Results rank by PageRank algorithms or sponsored placement.

General Crawler Limitations

90% of returned content often proves irrelevant to specific user needs
Cannot personalize results for different domains or user backgrounds
Ineffective at handling multimedia and dynamic content
Keyword-based searches lack semantic understanding

Focused Crawlers

Focused crawlers address these limitations by targeting specific topics. Unlike general crawlers, they filter content during extraction, capturing only relevant information. Our subsequent discussions center on building focused crawlers.

HTTP and HTTPS Protocols

Protocol Fundamentals

HTTP (HyperText Transfer Protocol) governs HTML page distribution. HTTPS adds SSL/TLS encryption layers for secure transmission.

HTTP default port: 80
HTTPS default port: 443

URL Structure

Uniform Resource Locator format: scheme://host[:port]/path/[?query][#fragment]

scheme: Protocol (http, https, ftp)
host: Domain or IP adress
port: Optional port specification
path: Resource location
query: Parameter string
fragment: Anchor navigation

HTTP Request Composition

HTTP transactions consist of client requests and server responses. Requests comprise four sections: request line, headers, blank line, and message body.

Browser Request Flow

User enters URL; browser initiates GET/POST request
Server returns HTML response
Browser parses HTML, identifies dependent resources (images, CSS, JS)
Browser issues subsequent requests for dependencies
Complete page renders after all resources load

HTTP Methods

HTTP/1.0 defined GET, POST, HEAD. HTTP/1.1 added OPTIONS, PUT, DELETE, TRACE, CONNECT.

Method	Description
GET	Retrieve specified resource
HEAD	Fetch headers only (no body)
POST	Submit data for processing
PUT	Replace target resource entirely
DELETE	Remove specified resource
CONNECT	Establish tunnel to proxy
OPTIONS	Describe communication options
TRACE	Perform message loop-back test

GET vs POST

GET: Data retrieval; parameters visible in URL; length restrictions apply
POST: Data submission; parameters in request body; no length limits; used for forms and file uploads

Essential Request Headers

1. Host
Specifies target hostname and port from the URL.

2. Connection
Manages persistent connections. HTTP/1.1 defaults to keep-alive, allowing multiple requests over a single TCP connection. Servers respond with Connection: keep-alive to maintain the channel or Connection: close to terminate it.

3. Upgrade-Insecure-Requests
Instructs browsers to upgrade HTTP resources to HTTPS automatically, preventing mixed-content warnings.

4. User-Agent
Identifies the client application, operating system, and version. Critical for avoiding bot detection.

5. Except
Defines acceptable MIME types for the response. Quality factors (q) indicate preference weights (0.0 to 1.0). Example: Accept: text/html,application/xhtml+xml;q=0.9,*/*;q=0.8

6. Referer
Indicates the originating page URL. Servers use this for analytics and anti-hotlinking protection.

7. Accept-Encoding
Specifies supported compression algorithms (gzip, deflate, br). Enables bandwidth reduction through encoded transfers.

8. Accept-Language
Lists preferred natural languages (en, zh-CN, etc.) for content negotiation.

9. Accept-Charset
Declares supported character encodings (UTF-8, ISO-8859-1, GB2312). Defaults to any if omitted.

10. Cookie
Transmits stored cookies for session management and user identification.

11. Content-Type
In POST requests, defines the media type of the body. Example: Content-Type: application/json; charset=utf-8

HTTP Response Structure

Responses contain: status line, headers, blank line, and response body.

Sample Response

HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Mon, 15 Jan 2024 08:30:15 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 15384
Connection: keep-alive
Cache-Control: max-age=3600
Content-Encoding: gzip

Key Response Headers

1. Cache-Control
Directs client caching behavior. no-cache forces revalidation; max-age=3600 permits 1-hour caching.

2. Connection
Mirrors the request's connection preference.

3. Content-Encoding
Indicates response compression method (gzip, deflate).

4. Content-Type
Specifies MIME type and character encoding. Mismatches cause display corruption.

5. Date
Server timestamp in GMT, ensuring timezone consistency.

6. Expires
Legacy caching directive; less reliable than Cache-Control due to clock synchronization issues.

7. Server
Reveals server software version (often masked for security).

8. Transfer-Encoding: chunked
Signals dynamic content delivered in segmented parts, with a zero-length block marking completion.

Status Codes

1xx: Informational (request received, continuing)
2xx: Success (200 OK, 201 Created)
3xx: Redirection (301 Moved Permanently, 302 Found, 304 Not Modified)
4xx: Client errors (400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found)
5xx: Server errors (500 Internal Server Error, 503 Service Unavailable)

Session Management

HTTP's stateless nature requires mechanisms to maintain client context:

Cookies: Client-side data storage for session identifiers and preferences
Sessions: Server-side storage linked to client cookies

Python Requests Library

The Requests library simplifies HTTP operations in Python, built on urllib3. It supports Python 2.7–3.11+ and PyPy.

Installation: pip install requests

Basic GET Request

import requests

resp = requests.get('https://httpbin.org/get')
print(resp.status_code)        # 200
print(resp.headers['content-type'])  # application/json
print(resp.text)               # JSON response

GET with Parameters and Headers

import requests

search_params = {'q': 'python web scraping', 'limit': 10}
custom_headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get(
    'https://api.github.com/search/repositories',
    params=search_params,
    headers=custom_headers
)

print(f"Final URL: {response.url}")
print(response.json()['total_count'])

Image Download Example

import requests

image_url = 'https://httpbin.org/image/png'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

image_data = requests.get(image_url, headers=headers).content

with open('sample_image.png', 'wb') as file_handle:
    file_handle.write(image_data)

Forum Pagination Scraper

Scraping the first 10 pages of a discussion forum:

import time
import requests

class ForumCrawler:
    def __init__(self, forum_topic):
        self.topic = forum_topic
        self.base_url = f"https://httpbin.org/html?page={{}}"
        self.request_headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
        }
    
    def generate_page_urls(self):
        """Create list of URLs for pages 1-10"""
        return [self.base_url.format(page_num) for page_num in range(1, 11)]
    
    def fetch_content(self, target_url):
        """Retrieve page content with error handling"""
        try:
            response = requests.get(
                target_url, 
                headers=self.request_headers,
                timeout=15
            )
            response.raise_for_status()
            return response.text
        except requests.RequestException as error:
            print(f"Request failed: {error}")
            return None
    
    def save_document(self, html_content, page_number):
        """Store HTML content to local file"""
        if html_content:
            filename = f"{self.topic}_page_{page_number:02d}.html"
            with open(filename, 'w', encoding='utf-8') as output_file:
                output_file.write(html_content)
            print(f"Saved: {filename}")
    
    def execute(self):
        """Main execution flow"""
        page_urls = self.generate_page_urls()
        
        for index, url in enumerate(page_urls, start=1):
            print(f"Processing page {index}...")
            content = self.fetch_content(url)
            self.save_document(content, index)
            time.sleep(1)  # Respect rate limits

# Usage
if __name__ == "__main__":
    topic = input("Enter forum topic: ")
    scraper = ForumCrawler(topic)
    scraper.execute()

Tags: web-scraping

Back to List

Prev: Architectural Design for Third-Party System Integration

Next: Configuring Typora to Automatically Upload Local Images to Cnblogs

Fading Coder

Web Scraping Fundamentals: Understanding Crawlers and HTTP Protocol

Data Acquisition in the Digital Age

Crawler Classifications

General-Purpose Crawlers

Operational Workflow

Phase 1: Page Discovery and Retrieval

Phase 2: Raw Data Storage

Phase 3: Content Preprocessing

Phase 4: Search Service and Ranking

General Crawler Limitations

Focused Crawlers

HTTP and HTTPS Protocols

Protocol Fundamentals

URL Structure

HTTP Request Composition

Browser Request Flow

HTTP Methods

GET vs POST

Essential Request Headers

HTTP Response Structure

Sample Response

Key Response Headers

Status Codes

Session Management

Python Requests Library

Basic GET Request

GET with Parameters and Headers

Image Download Example

Forum Pagination Scraper

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Web Scraping Fundamentals: Understanding Crawlers and HTTP Protocol

Data Acquisition in the Digital Age

Crawler Classifications

General-Purpose Crawlers

Operational Workflow

Phase 1: Page Discovery and Retrieval

Phase 2: Raw Data Storage

Phase 3: Content Preprocessing

Phase 4: Search Service and Ranking

General Crawler Limitations

Focused Crawlers

HTTP and HTTPS Protocols

Protocol Fundamentals

URL Structure

HTTP Request Composition

Browser Request Flow

HTTP Methods

GET vs POST

Essential Request Headers

HTTP Response Structure

Sample Response

Key Response Headers

Status Codes

Session Management

Python Requests Library

Basic GET Request

GET with Parameters and Headers

Image Download Example

Forum Pagination Scraper

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment