Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Data Extraction in Scrapy: Targeted Parsing and Broad Extraction Patterns

Tech 1

Targeted HTML Extraction with XPath

Scrapy provides built-in selectors that allow precise targeting of DOM elements using XPath or CSS expressions. XPath is particularly effective for navigating complex nested structures or locating nodes based on specific attributes.

import scrapy

class BlogScraper(scrapy.Spider):
    name = 'blog_data_extractor'
    start_urls = ['https://blog-site.org']

    def parse(self, response):
        for entry in response.xpath('//article[contains(@class, "entry")]'): 
            yield {
                'headline': entry.xpath('.//h2/text()').get(),
                'writer': entry.xpath('.//span[@id="creator"]/text()').get(),
                'feedback_count': entry.xpath('.//div[@class="interaction"]/text()').get(),
            }

The .xpath() method isolates specific DOM nodes, while .get() retrieves the first matching textual value inside that scope.

Broad Extraction Strategies with CSS Selectors

When crawling multiple heterogeneous domains, relying on rigid structural selectors becomes impractical. Broad extraction relies on universal HTML elements that exist across most web pages, regardless of their specific layout.

import scrapy

class UniversalTextScraper(scrapy.Spider):
    name = 'multi_domain_text'
    start_urls = ['https://domain-a.com', 'https://domain-b.net', 'https://domain-c.org']

    def parse(self, response):
        text_blocks = response.css('p::text').getall()
        for text in text_blocks:
            yield {'text_content': text}

By targeting common elements like paragraph tags, the scraper adapts to diverse layouts without requiring domain-specific logic. To harvest navigational routes universally, anchor tags and their source attributes provide a reliable pattern:

import scrapy

class HrefExtractor(scrapy.Spider):
    name = 'absolute_url_fetcher'
    start_urls = ['https://sample-site.io']

    def parse(self, response):
        relative_paths = response.css('a::attr(href)').getall()
        for path in relative_paths:
            yield {'absolute_path': response.urljoin(path)}

The .css() method captures all anchor references, and urljoin() guarantees fully qualified URLs, making the extraction robust across any target architecture.

Tags: scrapy

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.