Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Understanding Web Scraping: Core Concepts and HTTP Fundamentals

Notes May 14 2

Introduction to Web Crawlers

A web crawler is a program designed to collect data from the internet. At its core, a crawler simulates a browser to access websites and extract the required information.

Crawlers fall into two main categories: General-Purpose Crawlers and Focused Crawlers.

General-Purpose Crawlers form the backbone of search engines. For instance, Baidu's search bot continuously scans vast amounts of web content and indexes it. When users search for keywords, the engine analyzes the query, retrieves relevant indexed pages, ranks them according to specific algorithms, and displays the results.

These crawlers must follow the robots protocol—a convention that tells search engines which pages to access and which to avoid. While not legally binding, this protocol represents the "君子协议" (gentleman's agreement) of the internet.

Focused Crawlers target specific information needs. They filter pages during crawling, keeping only content relevant to the intended purpose. This approach serves users seeking domain-specific information.

How Web Requests Work

When you access a webpage, the following sequence occurs:

  1. The browser sends an HTTP request to the server based on the URL
  2. The server parses and processes the incoming request
  3. The server prepares the response data
  4. The server returns the response in HTTP format
  5. The browser receives, parses, and renders the content

Here's a simple socket-based server demonstrating this flow:

import socket


def process_client(client_conn):
    incoming = client_conn.recv(1024)
    print(incoming.decode())
    headers = "HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=utf-8\r\n\r\n"
    body = "<h1 style='color:blue'>Welcome!</h1>"
    client_conn.send(headers.encode() + body.encode())


def start_server():
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind(('127.0.0.1', 9000))
    server.listen(3)

    while True:
        conn, addr = server.accept()
        process_client(conn)
        conn.close()


if __name__ == '__main__':
    start_server()

URL Structure

A URL (Uniform Resource Locator) identifies resources like HTML files, images, videos, or other web assets.

The format follows this patern: protocol://hostname[:port]/path[?query]

Using https://example.com/api/users?role=admin&id=5 as an example:

  • protocol: The network transfer scheme (http, https, ftp)
  • hostname: The domain name or IP address of the server
  • port: A number between 0 and 65535. Omitting it uses the default port—80 for HTTP and 443 for HTTPS
  • path: The route to the specific resource, typically representing a directory or file path
  • query: Parameters starting after ? and before #, containing key-value pairs separated by &

HTTP Protocol

HTTP (Hyper Text Transfer Protocol) governs data exchange between clients and servers. HTTPS is the secure variant that encrypts transmitted data.

HTTP Requests

A request consists of:

GET         /                        HTTP/1.1
Method     Request-target            Version

The method defines the action to perform. The most common methods are:

  • GET: Retrieves data from the server without modifying any resources
  • POST: Submits data to the server (form submissions, file uploads) that affects server-side resources

Request headers provide additional context:

  • Referer: Shows the originating URL, useful for anti-scraping verification
  • User-Agent: Identifies the client application. Servers use this to detect crawlers—if the User-Agent is "Python", it easily reveals automated access. Modifying this header to mimic a real browser is a common practice
  • Cookie: Maintains user session state since HTTP is stateless

The request body contains submitted data for POST requests.

HTTP Responses

HTTP/1.1   200     OK
Version   Code    Message

Status codes indicate the result:

Code Meaning
200 Success, data returned normal
301 Permanent redirect
302 Temporary redirect (e.g., redirecting to login)
400 Invalid request or missing URL
403 Access forbidden
500 Internal server error

Response headers include:

  • Content-Length: Size of the returned data in bytes
  • Content-Type: MIME type of the response (text/html, application/json, etc.)

Crawler Development Workflow

Building a crawler typically involves:

  1. Define the target - Determine which URLs to crawl and what data to extract
  2. Send requests - Use libraries like Requests or urllib to fetch pages
  3. Parse responses - Extract the needed data using BeautifulSoup, lxml, or similar tools
  4. Store results - Save data to databases, Excel files, or document formats

Browser Developer Tools for Debugging

Developer tools in browsers provide essential capabilities for understanding web traffic:

  • Network tab: Captures all network requests
  • ALL filter: Shows complete network traffic
  • XHR filter: Displays asynchronous requests (AJAX calls)
  • JS filter: Shows JavaScript file requests

Click any network entry to examine request/response details in the right panel. The Sources tab allows setting breakpoints and debugging JavaScript, which helps analyze how certain parameters or tokens are generated during crawling.

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.