Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Web Scraping Fundamentals: Understanding Crawlers and HTTP Protocol

Tech May 17 4

Data Acquisition in the Digital Age

In today's data-driven landscape, information originates from multiple channels:

  • Enterprise-generated user data: Baidu Index, Alibaba Index, Tencent Browsing Index, Weibo Index
  • Purchased datasets: Data marketplaces and exchanges
  • Government/institutional open data: National statistics bureaus, World Bank, UN databases
  • Consulting firm reports: McKinsey, Accenture, iResearch
  • Web scraping: When required data isn't commercially available or purchasing isn't viable, organizations hire scraping specialists to extract information directly from websites

Crawler Classifications

Web crawlers fall into two primary categories based on application scenarios: general-purpose crawlers and focused crawlers.

General-Purpose Crawlers

General web crawlers form the backbone of search engine indexing systems (Google, Baidu, Bing). Their core mission involves downloading web pages across the internet to create mirrored content repositories for search services.

Operational Workflow

The performance of general crawlers directly impacts search engine quality by determining content richness and freshness. The process follows four distinct phases:

Phase 1: Page Discovery and Retrieval
  1. Initialize a seed URL set and queue them for processing
  2. Dequeue URLs, resolve DNS to IP addresses, download corresponding pages, store in a repository, and mark as crawled
  3. Extract new URLs from downloaded content and enqueue them for subsequent crawling cycles

Search engines discover new websites through:

  • Direct URL submission via webmaster tools
  • External backlinks from existing indexed sites
  • Partnerships with DNS providers for rapid domain detection

Crawlers must respect specific directives including nofollow attributes and the Robots Exclusion Protocol. This standard allows websites to define crawlable paths via robots.txt files (e.g., https://www.taobao.com/robots.txt).

Phase 2: Raw Data Storage

Downloaded HTML content enters a raw page database, preserving the exact markup rendered in browsers. Crawlers perform duplicate content detection, often abandoning sites with excessive plagiarized material.

Phase 3: Content Preprocessing

Before indexing, systems execute:

  • Text extraction and tokenization
  • Noise removal (navigation, ads, copyright notices)
  • Link relationship analysis
  • Special file handling (PDF, DOC, XLS, PPT, TXT)

Note: Search engines struggle with non-text content like images, videos, and Flash, and cannot execute JavaScript.

Phase 4: Search Service and Ranking

Processed information becomes searchable through keyword queries. Results rank by PageRank algorithms or sponsored placement.

General Crawler Limitations

  1. 90% of returned content often proves irrelevant to specific user needs
  2. Cannot personalize results for different domains or user backgrounds
  3. Ineffective at handling multimedia and dynamic content
  4. Keyword-based searches lack semantic understanding

Focused Crawlers

Focused crawlers address these limitations by targeting specific topics. Unlike general crawlers, they filter content during extraction, capturing only relevant information. Our subsequent discussions center on building focused crawlers.

HTTP and HTTPS Protocols

Protocol Fundamentals

HTTP (HyperText Transfer Protocol) governs HTML page distribution. HTTPS adds SSL/TLS encryption layers for secure transmission.

  • HTTP default port: 80
  • HTTPS default port: 443

URL Structure

Uniform Resource Locator format: scheme://host[:port]/path/[?query][#fragment]

  • scheme: Protocol (http, https, ftp)
  • host: Domain or IP adress
  • port: Optional port specification
  • path: Resource location
  • query: Parameter string
  • fragment: Anchor navigation

HTTP Request Composition

HTTP transactions consist of client requests and server responses. Requests comprise four sections: request line, headers, blank line, and message body.

Browser Request Flow

  1. User enters URL; browser initiates GET/POST request
  2. Server returns HTML response
  3. Browser parses HTML, identifies dependent resources (images, CSS, JS)
  4. Browser issues subsequent requests for dependencies
  5. Complete page renders after all resources load

HTTP Methods

HTTP/1.0 defined GET, POST, HEAD. HTTP/1.1 added OPTIONS, PUT, DELETE, TRACE, CONNECT.

Method Description
GET Retrieve specified resource
HEAD Fetch headers only (no body)
POST Submit data for processing
PUT Replace target resource entirely
DELETE Remove specified resource
CONNECT Establish tunnel to proxy
OPTIONS Describe communication options
TRACE Perform message loop-back test

GET vs POST

  • GET: Data retrieval; parameters visible in URL; length restrictions apply
  • POST: Data submission; parameters in request body; no length limits; used for forms and file uploads

Essential Request Headers

1. Host
Specifies target hostname and port from the URL.

2. Connection
Manages persistent connections. HTTP/1.1 defaults to keep-alive, allowing multiple requests over a single TCP connection. Servers respond with Connection: keep-alive to maintain the channel or Connection: close to terminate it.

3. Upgrade-Insecure-Requests
Instructs browsers to upgrade HTTP resources to HTTPS automatically, preventing mixed-content warnings.

4. User-Agent
Identifies the client application, operating system, and version. Critical for avoiding bot detection.

5. Except
Defines acceptable MIME types for the response. Quality factors (q) indicate preference weights (0.0 to 1.0). Example: Accept: text/html,application/xhtml+xml;q=0.9,*/*;q=0.8

6. Referer
Indicates the originating page URL. Servers use this for analytics and anti-hotlinking protection.

7. Accept-Encoding
Specifies supported compression algorithms (gzip, deflate, br). Enables bandwidth reduction through encoded transfers.

8. Accept-Language
Lists preferred natural languages (en, zh-CN, etc.) for content negotiation.

9. Accept-Charset
Declares supported character encodings (UTF-8, ISO-8859-1, GB2312). Defaults to any if omitted.

10. Cookie
Transmits stored cookies for session management and user identification.

11. Content-Type
In POST requests, defines the media type of the body. Example: Content-Type: application/json; charset=utf-8

HTTP Response Structure

Responses contain: status line, headers, blank line, and response body.

Sample Response

HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Mon, 15 Jan 2024 08:30:15 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 15384
Connection: keep-alive
Cache-Control: max-age=3600
Content-Encoding: gzip

Key Response Headers

1. Cache-Control
Directs client caching behavior. no-cache forces revalidation; max-age=3600 permits 1-hour caching.

2. Connection
Mirrors the request's connection preference.

3. Content-Encoding
Indicates response compression method (gzip, deflate).

4. Content-Type
Specifies MIME type and character encoding. Mismatches cause display corruption.

5. Date
Server timestamp in GMT, ensuring timezone consistency.

6. Expires
Legacy caching directive; less reliable than Cache-Control due to clock synchronization issues.

7. Server
Reveals server software version (often masked for security).

8. Transfer-Encoding: chunked
Signals dynamic content delivered in segmented parts, with a zero-length block marking completion.

Status Codes

  • 1xx: Informational (request received, continuing)
  • 2xx: Success (200 OK, 201 Created)
  • 3xx: Redirection (301 Moved Permanently, 302 Found, 304 Not Modified)
  • 4xx: Client errors (400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found)
  • 5xx: Server errors (500 Internal Server Error, 503 Service Unavailable)

Session Management

HTTP's stateless nature requires mechanisms to maintain client context:

  • Cookies: Client-side data storage for session identifiers and preferences
  • Sessions: Server-side storage linked to client cookies

Python Requests Library

The Requests library simplifies HTTP operations in Python, built on urllib3. It supports Python 2.7–3.11+ and PyPy.

Installation: pip install requests

Basic GET Request

import requests

resp = requests.get('https://httpbin.org/get')
print(resp.status_code)        # 200
print(resp.headers['content-type'])  # application/json
print(resp.text)               # JSON response

GET with Parameters and Headers

import requests

search_params = {'q': 'python web scraping', 'limit': 10}
custom_headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get(
    'https://api.github.com/search/repositories',
    params=search_params,
    headers=custom_headers
)

print(f"Final URL: {response.url}")
print(response.json()['total_count'])

Image Download Example

import requests

image_url = 'https://httpbin.org/image/png'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

image_data = requests.get(image_url, headers=headers).content

with open('sample_image.png', 'wb') as file_handle:
    file_handle.write(image_data)

Forum Pagination Scraper

Scraping the first 10 pages of a discussion forum:

import time
import requests

class ForumCrawler:
    def __init__(self, forum_topic):
        self.topic = forum_topic
        self.base_url = f"https://httpbin.org/html?page={{}}"
        self.request_headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
        }
    
    def generate_page_urls(self):
        """Create list of URLs for pages 1-10"""
        return [self.base_url.format(page_num) for page_num in range(1, 11)]
    
    def fetch_content(self, target_url):
        """Retrieve page content with error handling"""
        try:
            response = requests.get(
                target_url, 
                headers=self.request_headers,
                timeout=15
            )
            response.raise_for_status()
            return response.text
        except requests.RequestException as error:
            print(f"Request failed: {error}")
            return None
    
    def save_document(self, html_content, page_number):
        """Store HTML content to local file"""
        if html_content:
            filename = f"{self.topic}_page_{page_number:02d}.html"
            with open(filename, 'w', encoding='utf-8') as output_file:
                output_file.write(html_content)
            print(f"Saved: {filename}")
    
    def execute(self):
        """Main execution flow"""
        page_urls = self.generate_page_urls()
        
        for index, url in enumerate(page_urls, start=1):
            print(f"Processing page {index}...")
            content = self.fetch_content(url)
            self.save_document(content, index)
            time.sleep(1)  # Respect rate limits

# Usage
if __name__ == "__main__":
    topic = input("Enter forum topic: ")
    scraper = ForumCrawler(topic)
    scraper.execute()

Tags: web-scraping

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.