Python Web Scraping Basics: The requests Library
1. Installing requests
The requests library provides features including URL retrieval, HTTP persistent connections and connection pooling, browser-style SSL verification, authentication, cookie sessions, chunked file uploads, streaming downloads, HTTP(S) proxy support, and connection timeout handling.
Since requests is not part of Python's standard library, it needs to be installed separately.
Open a terminal or command prompt and run:
pip install requests
2. Request Methods
2.1 GET Method
Format:
requests.get(url, params=None, **kwargs)
Parameters:
url: The URL to request.params: A dictionary or byte sequence appended to the URL as query parameters.**kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).
When you call requests.get() or requests.post(), it returns a Response object. Here are the properties and methods available on the Response object:
status_code: HTTP response status codeheaders: Response headersrequest.headers: Request headersurl: The requested URLencoding: Character encoding inferred from HTTP headersapparent_encoding: Character encoding analyzed from response contentcontent: Binary response content (automatically decodes gzip and deflate encoded responses)text: Text response contentjson(): Returns JSON parsed dataraise_for_status(): Raises an exception if status code is not 200
Example:
import requests
target_url = "https://www.baidu.com/"
response = requests.get(target_url, params=None)
print("Status code:", response.status_code)
print("Response headers:", response.headers)
print("Request headers:", response.request.headers)
print("Request URL:", response.url)
print("Encoding from headers:", response.encoding)
print("Apparent encoding:", response.apparent_encoding)
print("Binary content:", response.content)
print("Text content:", response.text)
print("Raise for status:", response.raise_for_status())
Output:
The text content retrieved may appear garbled. To fix this encoding issue, add the following line:
response.encoding = response.apparent_encoding
Output after fix:
2.2 POST Method
Format:
requests.post(url, data=None, json=None, **kwargs)
Parameters:
url: The URL to request.data: Dictionary, byte sequence, or file object sent as the request body.**kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).
2.3 Custom Headers
In the requests library, you can pass a dictionary containing User-Agent information directly to the headers parameter in get() or post() to customize request headers.
2.4 Setting Cookies
Cookies help record a user's personal information when visiting web pages and store them on the client side. When the user revisits the same web page, cookies maintani the login state.
2.5 Setting Timeout
When scraping web pages, sometimes the server may not respond promptly, causing the program to wait indefinitely. The requests library solves this by allowing you to set the timeout parameter in get() or post(). The program will stop waiting after the specified number of seconds and raise a Timeout exception.
HTTP requests using the requests library may raise exceptions such as requests.HTTPError, requests.URLRequired, and requests.Timeout. The Response object's raise_for_status() method can be used to catch requests.HTTPError when the HTTP status code is not 200.
2.6 Downloading Binary Files
Files such as images, audio, and video are fundamentally composed of binary data with specific storage formats and corresponding decoding methods. To scrape these files, you need to retrieve their binary data.