Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Building a Multi-threaded Novel Scraper with Python

Tech 1

A straightforward approach to extracting novel content from websites using Python's requests library and lxml for HTML parsing, with multi-threaded download capabilities.

Core Configuration

stop_flag = False
worker_threads = 5
running_state = False
thread_lock = threading.Lock()

Data Model

class Novel:
    title = None
    content = None
    writer = None
    source = None

Thread Worker Function

Each worker thread continuously retrieves chapter URLs from the queue and processes them until the queue is empty or the user interrupts the operation.

def worker():
    global url_queue, status_display, execution_status
    
    while url_queue:
        if stop_flag:
            status_display.insert(END, f"{threading.current_thread().name} terminated due to user cancellation.\n")
            status_display.yview_moveto(1)
            execution_status = False
            return
        
        thread_lock.acquire()
        chapter_url = url_queue.pop()
        thread_lock.release()
        
        fetch_chapter(chapter_url)
    
    status_display.insert(END, f"{threading.current_thread().name} finished processing.\n")
    status_display.yview_moveto(1)
    execution_status = False

Chapter Download Function

The URL parameter contains a pre-processed single chapter link stored in a collection. After fetching the page, parse the HTML and save the content to disk.

def fetch_chapter(chapter_url):
    global progress_label, progress_bar, progress_value, app_window, save_path
    
    client = requests.session()
    client.keep_alive = False
    
    response = client.get(chapter_url)
    tree = etree.HTML(response.text)
    
    novel_title = tree.xpath("/html/body/div[2]/div[3]/div[2]/a[3]/text()")
    chapter_title = tree.xpath("/html/body/div[2]/div[3]/div[3]/div/div[1]/div[2]/div[2]/text()")
    paragraphs = tree.xpath("/html/body/div[2]/div[3]/div[3]/div/div[1]/div[5]/p/text()")
    
    formatted_content = '\n'.join(paragraphs)
    
    output_file = f"{save_path}\\{novel_title}{chapter_title}.txt"
    with open(output_file, "a+", encoding="utf-8") as f:
        f.write(formatted_content)
    
    progress_value += 1
    progress_bar['value'] = progress_value
    app_window.update()
    
    print(formatted_content)
    status_display.insert(END, f"{threading.current_thread().name} completed: {novel_title}{chapter_title}\n")
    status_display.yview_moveto(1)
    
    client.close()

Initialization Logic

def start_crawl():
    global running_state, stop_flag, url_queue
    
    if running_state:
        return
    
    url_queue = list(catalogue_urls)  # Populate from UI
    running_state = True
    stop_flag = False
    
    threads = []
    for _ in range(worker_threads):
        t = threading.Thread(target=worker)
        t.start()
        threads.append(t)

Stopping Mechenism

def halt_crawl():
    global stop_flag
    stop_flag = True

The implementation demonstrates several essential patterns: thread-safe queue management using locks to prevent duplicate downloads, session reuse with connection pooling disabled for efficiency, XPath-based HTML traversal for extracting structured content, and progress tracking through UI updates.

Additional considerations for production use include implementing retry logic for failed requests, adding rate limiting to avoid server strain, and incorporating proper errer handling for malformed HTML structures.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.