Home > Notes > Content

Understanding Web Scraping: Core Concepts and HTTP Fundamentals

Notes May 14 2

Introduction to Web Crawlers

A web crawler is a program designed to collect data from the internet. At its core, a crawler simulates a browser to access websites and extract the required information.

Crawlers fall into two main categories: General-Purpose Crawlers and Focused Crawlers.

General-Purpose Crawlers form the backbone of search engines. For instance, Baidu's search bot continuously scans vast amounts of web content and indexes it. When users search for keywords, the engine analyzes the query, retrieves relevant indexed pages, ranks them according to specific algorithms, and displays the results.

These crawlers must follow the robots protocol—a convention that tells search engines which pages to access and which to avoid. While not legally binding, this protocol represents the "君子协议" (gentleman's agreement) of the internet.

Focused Crawlers target specific information needs. They filter pages during crawling, keeping only content relevant to the intended purpose. This approach serves users seeking domain-specific information.

How Web Requests Work

When you access a webpage, the following sequence occurs:

The browser sends an HTTP request to the server based on the URL
The server parses and processes the incoming request
The server prepares the response data
The server returns the response in HTTP format
The browser receives, parses, and renders the content

Here's a simple socket-based server demonstrating this flow:

import socket


def process_client(client_conn):
    incoming = client_conn.recv(1024)
    print(incoming.decode())
    headers = "HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=utf-8\r\n\r\n"
    body = "<h1 style='color:blue'>Welcome!</h1>"
    client_conn.send(headers.encode() + body.encode())


def start_server():
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind(('127.0.0.1', 9000))
    server.listen(3)

    while True:
        conn, addr = server.accept()
        process_client(conn)
        conn.close()


if __name__ == '__main__':
    start_server()

URL Structure

A URL (Uniform Resource Locator) identifies resources like HTML files, images, videos, or other web assets.

The format follows this patern: protocol://hostname[:port]/path[?query]

Using https://example.com/api/users?role=admin&id=5 as an example:

protocol: The network transfer scheme (http, https, ftp)
hostname: The domain name or IP address of the server
port: A number between 0 and 65535. Omitting it uses the default port—80 for HTTP and 443 for HTTPS
path: The route to the specific resource, typically representing a directory or file path
query: Parameters starting after ? and before #, containing key-value pairs separated by &

HTTP Protocol

HTTP (Hyper Text Transfer Protocol) governs data exchange between clients and servers. HTTPS is the secure variant that encrypts transmitted data.

HTTP Requests

A request consists of:

GET         /                        HTTP/1.1
Method     Request-target            Version

The method defines the action to perform. The most common methods are:

GET: Retrieves data from the server without modifying any resources
POST: Submits data to the server (form submissions, file uploads) that affects server-side resources

Request headers provide additional context:

Referer: Shows the originating URL, useful for anti-scraping verification
User-Agent: Identifies the client application. Servers use this to detect crawlers—if the User-Agent is "Python", it easily reveals automated access. Modifying this header to mimic a real browser is a common practice
Cookie: Maintains user session state since HTTP is stateless

The request body contains submitted data for POST requests.

HTTP Responses

HTTP/1.1   200     OK
Version   Code    Message

Status codes indicate the result:

Code	Meaning
200	Success, data returned normal
301	Permanent redirect
302	Temporary redirect (e.g., redirecting to login)
400	Invalid request or missing URL
403	Access forbidden
500	Internal server error

Response headers include:

Content-Length: Size of the returned data in bytes
Content-Type: MIME type of the response (text/html, application/json, etc.)

Crawler Development Workflow

Building a crawler typically involves:

Define the target - Determine which URLs to crawl and what data to extract
Send requests - Use libraries like Requests or urllib to fetch pages
Parse responses - Extract the needed data using BeautifulSoup, lxml, or similar tools
Store results - Save data to databases, Excel files, or document formats

Browser Developer Tools for Debugging

Developer tools in browsers provide essential capabilities for understanding web traffic:

Network tab: Captures all network requests
ALL filter: Shows complete network traffic
XHR filter: Displays asynchronous requests (AJAX calls)
JS filter: Shows JavaScript file requests

Click any network entry to examine request/response details in the right panel. The Sources tab allows setting breakpoints and debugging JavaScript, which helps analyze how certain parameters or tokens are generated during crawling.

Tags: Web Scraping http protocol

Back to List

Prev: Setting Up MySQL Master-Slave Replication with Docker Containers

Next: Deep Dive into SpringMVC DispatcherServlet Request Processing

Fading Coder

Understanding Web Scraping: Core Concepts and HTTP Fundamentals

Introduction to Web Crawlers

How Web Requests Work

URL Structure

HTTP Protocol

HTTP Requests

HTTP Responses

Crawler Development Workflow

Browser Developer Tools for Debugging

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Understanding Web Scraping: Core Concepts and HTTP Fundamentals

Introduction to Web Crawlers

How Web Requests Work

URL Structure

HTTP Protocol

HTTP Requests

HTTP Responses

Crawler Development Workflow

Browser Developer Tools for Debugging

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment