Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Python Web Scraping Basics: The requests Library

Tech 1

1. Installing requests

The requests library provides features including URL retrieval, HTTP persistent connections and connection pooling, browser-style SSL verification, authentication, cookie sessions, chunked file uploads, streaming downloads, HTTP(S) proxy support, and connection timeout handling.

Since requests is not part of Python's standard library, it needs to be installed separately.

Open a terminal or command prompt and run:

pip install requests

2. Request Methods

2.1 GET Method

Format:

requests.get(url, params=None, **kwargs)

Parameters:

  1. url: The URL to request.
  2. params: A dictionary or byte sequence appended to the URL as query parameters.
  3. **kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).

When you call requests.get() or requests.post(), it returns a Response object. Here are the properties and methods available on the Response object:

  • status_code: HTTP response status code
  • headers: Response headers
  • request.headers: Request headers
  • url: The requested URL
  • encoding: Character encoding inferred from HTTP headers
  • apparent_encoding: Character encoding analyzed from response content
  • content: Binary response content (automatically decodes gzip and deflate encoded responses)
  • text: Text response content
  • json(): Returns JSON parsed data
  • raise_for_status(): Raises an exception if status code is not 200

Example:

import requests

target_url = "https://www.baidu.com/"
response = requests.get(target_url, params=None)

print("Status code:", response.status_code)
print("Response headers:", response.headers)
print("Request headers:", response.request.headers)
print("Request URL:", response.url)
print("Encoding from headers:", response.encoding)
print("Apparent encoding:", response.apparent_encoding)
print("Binary content:", response.content)
print("Text content:", response.text)
print("Raise for status:", response.raise_for_status())

Output:

The text content retrieved may appear garbled. To fix this encoding issue, add the following line:

response.encoding = response.apparent_encoding

Output after fix:

2.2 POST Method

Format:

requests.post(url, data=None, json=None, **kwargs)

Parameters:

  1. url: The URL to request.
  2. data: Dictionary, byte sequence, or file object sent as the request body.
  3. **kwargs: Additional parameters controlling the request (headers, cookies, timeout, proxies, etc.).

2.3 Custom Headers

In the requests library, you can pass a dictionary containing User-Agent information directly to the headers parameter in get() or post() to customize request headers.

2.4 Setting Cookies

Cookies help record a user's personal information when visiting web pages and store them on the client side. When the user revisits the same web page, cookies maintani the login state.

2.5 Setting Timeout

When scraping web pages, sometimes the server may not respond promptly, causing the program to wait indefinitely. The requests library solves this by allowing you to set the timeout parameter in get() or post(). The program will stop waiting after the specified number of seconds and raise a Timeout exception.

HTTP requests using the requests library may raise exceptions such as requests.HTTPError, requests.URLRequired, and requests.Timeout. The Response object's raise_for_status() method can be used to catch requests.HTTPError when the HTTP status code is not 200.

2.6 Downloading Binary Files

Files such as images, audio, and video are fundamentally composed of binary data with specific storage formats and corresponding decoding methods. To scrape these files, you need to retrieve their binary data.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.