Home > Tech > Content

Python Web Scraping Basics: The requests Library

Tech Apr 22 20

1. Installing requests

The requests library provides features including URL retrieval, HTTP persistent connections and connection pooling, browser-style SSL verification, authentication, cookie sessions, chunked file uploads, streaming downloads, HTTP(S) proxy support, and connection timeout handling.

Since requests is not part of Python's standard library, it needs to be installed separately.

Open a terminal or command prompt and run:

pip install requests

2. Request Methods

2.1 GET Method

Format:

requests.get(url, params=None, **kwargs)

Parameters:

url: The URL to request.
params: A dictionary or byte sequence appended to the URL as query parameters.
**kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).

When you call requests.get() or requests.post(), it returns a Response object. Here are the properties and methods available on the Response object:

status_code: HTTP response status code
headers: Response headers
request.headers: Request headers
url: The requested URL
encoding: Character encoding inferred from HTTP headers
apparent_encoding: Character encoding analyzed from response content
content: Binary response content (automatically decodes gzip and deflate encoded responses)
text: Text response content
json(): Returns JSON parsed data
raise_for_status(): Raises an exception if status code is not 200

Example:

import requests

target_url = "https://www.baidu.com/"
response = requests.get(target_url, params=None)

print("Status code:", response.status_code)
print("Response headers:", response.headers)
print("Request headers:", response.request.headers)
print("Request URL:", response.url)
print("Encoding from headers:", response.encoding)
print("Apparent encoding:", response.apparent_encoding)
print("Binary content:", response.content)
print("Text content:", response.text)
print("Raise for status:", response.raise_for_status())

Output:

The text content retrieved may appear garbled. To fix this encoding issue, add the following line:

response.encoding = response.apparent_encoding

Output after fix:

2.2 POST Method

Format:

requests.post(url, data=None, json=None, **kwargs)

Parameters:

url: The URL to request.
data: Dictionary, byte sequence, or file object sent as the request body.
**kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).

2.3 Custom Headers

In the requests library, you can pass a dictionary containing User-Agent information directly to the headers parameter in get() or post() to customize request headers.

2.4 Setting Cookies

Cookies help record a user's personal information when visiting web pages and store them on the client side. When the user revisits the same web page, cookies maintani the login state.

2.5 Setting Timeout

When scraping web pages, sometimes the server may not respond promptly, causing the program to wait indefinitely. The requests library solves this by allowing you to set the timeout parameter in get() or post(). The program will stop waiting after the specified number of seconds and raise a Timeout exception.

HTTP requests using the requests library may raise exceptions such as requests.HTTPError, requests.URLRequired, and requests.Timeout. The Response object's raise_for_status() method can be used to catch requests.HTTPError when the HTTP status code is not 200.

2.6 Downloading Binary Files

Files such as images, audio, and video are fundamentally composed of binary data with specific storage formats and corresponding decoding methods. To scrape these files, you need to retrieve their binary data.

Back to List

Prev: Practical Use of Pointers in Competitive Programming with C++

Next: Understanding C++ Const Qualifier and Memory Management

Fading Coder

Python Web Scraping Basics: The requests Library

1. Installing requests

2. Request Methods

2.1 GET Method

2.2 POST Method

2.3 Custom Headers

2.4 Setting Cookies

2.5 Setting Timeout

2.6 Downloading Binary Files

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Python Web Scraping Basics: The requests Library

1. Installing requests

2. Request Methods

2.1 GET Method

2.2 POST Method

2.3 Custom Headers

2.4 Setting Cookies

2.5 Setting Timeout

2.6 Downloading Binary Files

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment