Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Python Web Scraping Essentials: Building a Recipe Discovery Tool from Scratch

Tech 1

Web scraping automates the manual workflow of browsing: transmitting HTTP requests to retrieve documents, navigating link structures, and extracting specific data points from the response. A scraper mimics browser behavior programmatical, enabling automated collection of structured information from HTML documents.

Fetching HTML Documents

The initial step involves executing an HTTP GET request to retrieve the raw HTML content. While numerous HTTP libraries exist for Python, the standard library's urllib module provides sufficient functionality without external dependencies.

import urllib.request
import urllib.parse

BASE_URL = "https://www.meishij.net/?from=space_block"
BROWSER_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
}

def retrieve_html(target_url):
    """Fetch and decode HTML content from a URL."""
    http_request = urllib.request.Request(target_url, headers=BROWSER_HEADERS)
    with urllib.request.urlopen(http_request) as response:
        return response.read().decode('utf-8')

raw_markup = retrieve_html(BASE_URL)
print(raw_markup[:1000])  # Display first 1000 characters

This returns the complete HTML source, identical to viewing the page source in browser developer tools.

Parsing Document Structure

Raw HTML requires parsing to extract meaningful data. While regular expressions or string manipulation could work, specialized parsing libraries provide robust DOM traversal. BeautifulSoup handles malformed markup gracefully and offers intuitive selection methods.

Extracting Trending Dishes

To identify currently popular recipes, we target specific CSS classes within the document structure:

from bs4 import BeautifulSoup

def parse_trending_dishes(html_content):
    """Extract trending recipe titles from the homepage."""
    dom_tree = BeautifulSoup(html_content, 'html.parser')
    trending_container = dom_tree.select('a.sancan_item')
    
    dish_names = []
    for card in trending_container:
        title_tags = card.select('strong.title')
        for tag in title_tags:
            dish_names.append(tag.get_text(strip=True))
    return dish_names

popular_recipes = parse_trending_dishes(raw_markup)
for dish in popular_recipes:
    print(dish)

The process involves visually inspecting the HTML structure (via browser dev tools), identifying the CSS selectors that target desired elements, and using BeautifulSoup's select() method to extract text content.

Randomized Meal Selection

Decision fatigue often complicates meal planning. By aggregating recipes across multiple pagination pages and applying random selection, we can automate the "what to eat" decision:

import random

def gather_recipe_pool(pages_to_scan=3):
    """Collect recipe names from multiple pagination pages."""
    recipe_inventory = []
    
    for page_index in range(1, pages_to_scan + 1):
        paginated_endpoint = f"https://www.meishij.net/chufang/diy/jiangchangcaipu/?&page={page_index}"
        page_markup = retrieve_html(paginated_endpoint)
        parsed_page = BeautifulSoup(page_markup, 'html.parser')
        
        image_tags = parsed_page.find_all('img')
        for img in image_tags:
            alt_text = img.get('alt')
            if alt_text:
                recipe_inventory.append(alt_text)
    
    return recipe_inventory

candidates = gather_recipe_pool()
selected_meal = random.choice(candidates)
print(f"Tonight's selection: {selected_meal}")

Retrieving Cooking Procedures

Once a dish is selected, fetching detailed cooking instructions requires navigating through search results to the specific recipe page:

def fetch_cooking_steps(dish_query):
    """Retrieve step-by-step cooking instructions for a specific dish."""
    # Construct search URL with proper encoding
    encoded_query = urllib.parse.quote(dish_query)
    search_url = f"https://so.meishij.net/index.php?q={encoded_query}"
    
    search_results = retrieve_html(search_url)
    search_soup = BeautifulSoup(search_results, 'html.parser')
    
    # Extract first result link
    result_links = search_soup.select('a.img')
    if not result_links:
        return []
    
    recipe_detail_url = result_links[0].get('href')
    detail_markup = retrieve_html(recipe_detail_url)
    detail_soup = BeautifulSoup(detail_markup, 'html.parser')
    
    # Extract instruction steps
    step_containers = detail_soup.select('div.step_content p')
    instructions = [step.get_text(strip=True) for step in step_containers]
    return instructions

steps = fetch_cooking_steps("红烧排骨")
for idx, instruction in enumerate(steps, 1):
    print(f"{idx}. {instruction}")

Building an Interactive Interface

Consolidating these capabilities into a reusable tool requires wrapping the logic in an interactive command-linne interface. This implementation uses colorized output to improved readability and keyboard-driven navigation:

import os
import sys
from colorama import init, Fore, Style
from readchar import readkey

class RecipeDiscoveryApp:
    def __init__(self):
        self.trending_cache = []
        self.recipe_database = []
        self.current_dish = None
        self.palette = [Fore.GREEN, Fore.YELLOW, Fore.BLUE, Fore.CYAN, Fore.MAGENTA, Fore.RED]
        init(autoreset=True)
    
    def clear_display(self):
        os.system('cls' if os.name == 'nt' else 'clear')
    
    def colorize(self, text):
        return f"{random.choice(self.palette)}{text}{Style.RESET_ALL}"
    
    def load_trending(self):
        if not self.trending_cache:
            html = retrieve_html(BASE_URL)
            self.trending_cache = parse_trending_dishes(html)
    
    def populate_database(self, page_limit=3):
        if not self.recipe_database:
            self.recipe_database = gather_recipe_pool(page_limit)
    
    def render_menu(self, options_list):
        self.clear_display()
        for index, item in enumerate(options_list[:8]):
            print(self.colorize(f"[{index}] {item}"))
        print(self.colorize("[R] Random Selection"))
        print(self.colorize("[I] Instructions"))
        print(self.colorize("[C] Clear Screen"))
        print(self.colorize("[Q] Quit"))
    
    def run(self):
        self.load_trending()
        random_subset = random.sample(self.trending_cache, min(8, len(self.trending_cache)))
        
        print(self.colorize("Initializing Recipe Discovery System..."))
        self.render_menu(random_subset)
        
        while True:
            key = readkey().lower()
            
            if key == 'q':
                break
            elif key == 'c':
                self.clear_display()
            elif key == 'r':
                self.populate_database()
                self.current_dish = random.choice(self.recipe_database)
                print(self.colorize(f"\nRandomly selected: {self.current_dish}"))
            elif key == 'i' and self.current_dish:
                print(self.colorize(f"\nFetching instructions for {self.current_dish}..."))
                steps = fetch_cooking_steps(self.current_dish)
                for i, step in enumerate(steps, 1):
                    print(self.colorize(f"{i}. {step}"))
            elif key.isdigit() and int(key) < len(random_subset):
                self.current_dish = random_subset[int(key)]
                print(self.colorize(f"\nSelected: {self.current_dish}"))

if __name__ == "__main__":
    app = RecipeDiscoveryApp()
    app.run()

Ethical and Technical Considerations

When implementing web scraping solutions, respect robots.txt directives and implement reasonable rate limiting to avoid overwhelming target servers. Many sites employ anti-scraping measures including CAPTCHA challenges, session validation, and IP blocking. These protection mechanisms indicate the site owner's intent to restrict automated access.

Start with public information repositories and news aggregation sites, which typically present fewer technical barriers. Avoid attempting to extract sensitive financial data, personal user information, or proprietary content behind authentication bariers. The technical complexity of scraping often scales directly with the sensitivity of the target data, as robust protection mechanisms generally guard high-value information.

For production implementations, consider utilizing official APIs when available, implementing proper request throttling, and maintaining compliance with data protection regulations and terms of service.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.