Understanding Web Scraping: Core Concepts and HTTP Fundamentals
Introduction to Web Crawlers
A web crawler is a program designed to collect data from the internet. At its core, a crawler simulates a browser to access websites and extract the required information.
Crawlers fall into two main categories: General-Purpose Crawlers and Focused Crawlers.
General-Purpose Crawlers form the backbone of search engines. For instance, Baidu's search bot continuously scans vast amounts of web content and indexes it. When users search for keywords, the engine analyzes the query, retrieves relevant indexed pages, ranks them according to specific algorithms, and displays the results.
These crawlers must follow the robots protocol—a convention that tells search engines which pages to access and which to avoid. While not legally binding, this protocol represents the "君子协议" (gentleman's agreement) of the internet.
Focused Crawlers target specific information needs. They filter pages during crawling, keeping only content relevant to the intended purpose. This approach serves users seeking domain-specific information.
How Web Requests Work
When you access a webpage, the following sequence occurs:
- The browser sends an HTTP request to the server based on the URL
- The server parses and processes the incoming request
- The server prepares the response data
- The server returns the response in HTTP format
- The browser receives, parses, and renders the content
Here's a simple socket-based server demonstrating this flow:
import socket
def process_client(client_conn):
incoming = client_conn.recv(1024)
print(incoming.decode())
headers = "HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=utf-8\r\n\r\n"
body = "<h1 style='color:blue'>Welcome!</h1>"
client_conn.send(headers.encode() + body.encode())
def start_server():
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.bind(('127.0.0.1', 9000))
server.listen(3)
while True:
conn, addr = server.accept()
process_client(conn)
conn.close()
if __name__ == '__main__':
start_server()
URL Structure
A URL (Uniform Resource Locator) identifies resources like HTML files, images, videos, or other web assets.
The format follows this patern: protocol://hostname[:port]/path[?query]
Using https://example.com/api/users?role=admin&id=5 as an example:
- protocol: The network transfer scheme (http, https, ftp)
- hostname: The domain name or IP address of the server
- port: A number between 0 and 65535. Omitting it uses the default port—80 for HTTP and 443 for HTTPS
- path: The route to the specific resource, typically representing a directory or file path
- query: Parameters starting after
?and before#, containing key-value pairs separated by&
HTTP Protocol
HTTP (Hyper Text Transfer Protocol) governs data exchange between clients and servers. HTTPS is the secure variant that encrypts transmitted data.
HTTP Requests
A request consists of:
GET / HTTP/1.1
Method Request-target Version
The method defines the action to perform. The most common methods are:
- GET: Retrieves data from the server without modifying any resources
- POST: Submits data to the server (form submissions, file uploads) that affects server-side resources
Request headers provide additional context:
- Referer: Shows the originating URL, useful for anti-scraping verification
- User-Agent: Identifies the client application. Servers use this to detect crawlers—if the User-Agent is "Python", it easily reveals automated access. Modifying this header to mimic a real browser is a common practice
- Cookie: Maintains user session state since HTTP is stateless
The request body contains submitted data for POST requests.
HTTP Responses
HTTP/1.1 200 OK
Version Code Message
Status codes indicate the result:
| Code | Meaning |
|---|---|
| 200 | Success, data returned normal |
| 301 | Permanent redirect |
| 302 | Temporary redirect (e.g., redirecting to login) |
| 400 | Invalid request or missing URL |
| 403 | Access forbidden |
| 500 | Internal server error |
Response headers include:
- Content-Length: Size of the returned data in bytes
- Content-Type: MIME type of the response (text/html, application/json, etc.)
Crawler Development Workflow
Building a crawler typically involves:
- Define the target - Determine which URLs to crawl and what data to extract
- Send requests - Use libraries like Requests or urllib to fetch pages
- Parse responses - Extract the needed data using BeautifulSoup, lxml, or similar tools
- Store results - Save data to databases, Excel files, or document formats
Browser Developer Tools for Debugging
Developer tools in browsers provide essential capabilities for understanding web traffic:
- Network tab: Captures all network requests
- ALL filter: Shows complete network traffic
- XHR filter: Displays asynchronous requests (AJAX calls)
- JS filter: Shows JavaScript file requests
Click any network entry to examine request/response details in the right panel. The Sources tab allows setting breakpoints and debugging JavaScript, which helps analyze how certain parameters or tokens are generated during crawling.