Python Web Scraping Tutorial: From Fundamentals to Practical Bilibili Video Metadata Extraction
What Are Web Crawlers
Web crawlers are automated scripts designed to systematically navigate public web pages, retrieve structured and unstructured data, and aggregate information for downstream analysis. All crawler operations must adhere to the target site's robots.txt rules, rate limiting requirements, and data privacy regulations to avoid excessive server load or unauthorized data collection.
The standard workflow for a web crawler follows these core steps:
- Seed URL initialization: Crawlers start execution from a pre-defined set of entry URLs, typically high-authority pages or domain homepages.
- Page content retrieval: The crawler sends standard HTTP/HTTPS rqeuests to target URLs, downloading response data which may be HTML, XML, JSON, or binary assets.
- Content parsing: Raw response data is processed to extract target fields, using tools ranging from regex matches to dedicated HTML parsers and NLP models for unstructured text.
- Link discovery: Parsed page content is scanned for embedded hyperlinks, which are added to the crawl queue if they fall within the target scope.
- Data persistence: Extracted valid data is written to databases, flat files, or data warehouses for latter querying and analysis.
- Duplication elimination: A dedicated tracking layer ensures each unique URL is only processed once, to avoid redundant resource consumption and duplicate data.
- Task scheduling: For large-scale crawl operations, a scheduler module distributes tasks across multiple worker instances, balances load, and handles retries for failed requests.
Basic Crawl Implementation
First install the network request library requests:
pip install requests
Basic Request Example
This sample sends a request to a public test crawl site and prints returned page content:
import requests
# Send GET request to test crawl target
page_resp = requests.get("http://books.toscrape.com/")
# Validate request success
if page_resp.status_code == 200:
print(page_resp.text)
else:
print(f"Request failed with status code: {page_resp.status_code}")
Mimicking Browser Requests with Custom Headers
Many web services implement bot detection logic that blocks requests with default user-agent strings associated with scripting libraries. You can verify if direct requests are allowed by checking the response status code: 403, 404, or 503 responses for valid pages indicate bot blocking.
To bypass basic detection, add a valid browser User-Agent to your request headers. You can retrieve your own browser's User-Agent by opening the browser dev tools (F12 or right-click > Inspect), navigating to the Network tab, selecting any active request, and copying the User-Agent value from the request headers section.
Authenticated Request Example
import requests
# Define request headers with valid browser User-Agent
req_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# Send request to Bilibili homepage
bilibili_resp = requests.get("https://www.bilibili.com/", headers=req_headers)
if bilibili_resp.ok:
print(bilibili_resp.text)
else:
print("Crawl request rejected by target server")
Parsing HTML Content with Beautiful Soup
Raw HTML response data contains large amounts of irrelevant markup, so we use HTML parsing libraries to extract target fields efficiently. Install the Beautiful Soup 4 library first:
pip install bs4
Basic Parsing Example
import requests
from bs4 import BeautifulSoup
req_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# Retrieve raw HTML content
raw_html = requests.get("https://www.bilibili.com/", headers=req_headers).text
# Parse HTML with built-in html.parser
parsed_dom = BeautifulSoup(raw_html, "html.parser")
# Extract all h3 heading elements
heading_tags = parsed_dom.find_all("h3")
# Iterate through headings to extract embedded link text
for heading in heading_tags:
link_element = heading.find("a")
if link_element and link_element.string:
print(link_element.string.strip())
Practical Use Case: Extract Bilibili Video Metadata
You can use the element picker in browser dev tools (the cursor icon in the top left of the dev tools panel) to quickly locate the HTML structure of target elements. For Bilibili homepage video cards, each entry is wrapped in a div with class bili-video-card__wrap __scale-wrap. Video links are stored in the href attribute of the child a tag, and cover image URLs are stored in the img tag inside the child picture element with class v-img bili-video-card__cover.
Complete Extraction Code
import requests
from bs4 import BeautifulSoup
req_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
raw_html = requests.get("https://www.bilibili.com/", headers=req_headers).text
parsed_dom = BeautifulSoup(raw_html, "html.parser")
# Retrieve all video card elements
video_card_list = parsed_dom.find_all("div", class_="bili-video-card__wrap __scale-wrap")
for card in video_card_list:
# Extract and format video link
video_link = card.a.get("href", "No link available")
full_video_link = f"https:{video_link}" if video_link.startswith("//") else video_link
print(f"Video URL: {full_video_link}")
# Extract cover image URL
cover_container = card.find("picture", class_="v-img bili-video-card__cover")
if cover_container and cover_container.img:
cover_url = cover_container.img.get("src", "No cover available")
print(f"Cover Image URL: {cover_url}\n")