Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Python Web Scraping Tutorial: From Fundamentals to Practical Bilibili Video Metadata Extraction

Tech 2

What Are Web Crawlers

Web crawlers are automated scripts designed to systematically navigate public web pages, retrieve structured and unstructured data, and aggregate information for downstream analysis. All crawler operations must adhere to the target site's robots.txt rules, rate limiting requirements, and data privacy regulations to avoid excessive server load or unauthorized data collection.

The standard workflow for a web crawler follows these core steps:

  1. Seed URL initialization: Crawlers start execution from a pre-defined set of entry URLs, typically high-authority pages or domain homepages.
  2. Page content retrieval: The crawler sends standard HTTP/HTTPS rqeuests to target URLs, downloading response data which may be HTML, XML, JSON, or binary assets.
  3. Content parsing: Raw response data is processed to extract target fields, using tools ranging from regex matches to dedicated HTML parsers and NLP models for unstructured text.
  4. Link discovery: Parsed page content is scanned for embedded hyperlinks, which are added to the crawl queue if they fall within the target scope.
  5. Data persistence: Extracted valid data is written to databases, flat files, or data warehouses for latter querying and analysis.
  6. Duplication elimination: A dedicated tracking layer ensures each unique URL is only processed once, to avoid redundant resource consumption and duplicate data.
  7. Task scheduling: For large-scale crawl operations, a scheduler module distributes tasks across multiple worker instances, balances load, and handles retries for failed requests.

Basic Crawl Implementation

First install the network request library requests:

pip install requests

Basic Request Example

This sample sends a request to a public test crawl site and prints returned page content:

import requests

# Send GET request to test crawl target
page_resp = requests.get("http://books.toscrape.com/")

# Validate request success
if page_resp.status_code == 200:
    print(page_resp.text)
else:
    print(f"Request failed with status code: {page_resp.status_code}")

Mimicking Browser Requests with Custom Headers

Many web services implement bot detection logic that blocks requests with default user-agent strings associated with scripting libraries. You can verify if direct requests are allowed by checking the response status code: 403, 404, or 503 responses for valid pages indicate bot blocking.

To bypass basic detection, add a valid browser User-Agent to your request headers. You can retrieve your own browser's User-Agent by opening the browser dev tools (F12 or right-click > Inspect), navigating to the Network tab, selecting any active request, and copying the User-Agent value from the request headers section.

Authenticated Request Example

import requests

# Define request headers with valid browser User-Agent
req_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}

# Send request to Bilibili homepage
bilibili_resp = requests.get("https://www.bilibili.com/", headers=req_headers)

if bilibili_resp.ok:
    print(bilibili_resp.text)
else:
    print("Crawl request rejected by target server")

Parsing HTML Content with Beautiful Soup

Raw HTML response data contains large amounts of irrelevant markup, so we use HTML parsing libraries to extract target fields efficiently. Install the Beautiful Soup 4 library first:

pip install bs4

Basic Parsing Example

import requests
from bs4 import BeautifulSoup

req_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}

# Retrieve raw HTML content
raw_html = requests.get("https://www.bilibili.com/", headers=req_headers).text

# Parse HTML with built-in html.parser
parsed_dom = BeautifulSoup(raw_html, "html.parser")

# Extract all h3 heading elements
heading_tags = parsed_dom.find_all("h3")

# Iterate through headings to extract embedded link text
for heading in heading_tags:
    link_element = heading.find("a")
    if link_element and link_element.string:
        print(link_element.string.strip())

Practical Use Case: Extract Bilibili Video Metadata

You can use the element picker in browser dev tools (the cursor icon in the top left of the dev tools panel) to quickly locate the HTML structure of target elements. For Bilibili homepage video cards, each entry is wrapped in a div with class bili-video-card__wrap __scale-wrap. Video links are stored in the href attribute of the child a tag, and cover image URLs are stored in the img tag inside the child picture element with class v-img bili-video-card__cover.

Complete Extraction Code

import requests
from bs4 import BeautifulSoup

req_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}

raw_html = requests.get("https://www.bilibili.com/", headers=req_headers).text
parsed_dom = BeautifulSoup(raw_html, "html.parser")

# Retrieve all video card elements
video_card_list = parsed_dom.find_all("div", class_="bili-video-card__wrap __scale-wrap")

for card in video_card_list:
    # Extract and format video link
    video_link = card.a.get("href", "No link available")
    full_video_link = f"https:{video_link}" if video_link.startswith("//") else video_link
    print(f"Video URL: {full_video_link}")

    # Extract cover image URL
    cover_container = card.find("picture", class_="v-img bili-video-card__cover")
    if cover_container and cover_container.img:
        cover_url = cover_container.img.get("src", "No cover available")
        print(f"Cover Image URL: {cover_url}\n")

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.