Home > Tech > Content

Data Extraction in Scrapy: Targeted Parsing and Broad Extraction Patterns

Tech 1

Targeted HTML Extraction with XPath

Scrapy provides built-in selectors that allow precise targeting of DOM elements using XPath or CSS expressions. XPath is particularly effective for navigating complex nested structures or locating nodes based on specific attributes.

import scrapy

class BlogScraper(scrapy.Spider):
    name = 'blog_data_extractor'
    start_urls = ['https://blog-site.org']

    def parse(self, response):
        for entry in response.xpath('//article[contains(@class, "entry")]'): 
            yield {
                'headline': entry.xpath('.//h2/text()').get(),
                'writer': entry.xpath('.//span[@id="creator"]/text()').get(),
                'feedback_count': entry.xpath('.//div[@class="interaction"]/text()').get(),
            }

The .xpath() method isolates specific DOM nodes, while .get() retrieves the first matching textual value inside that scope.

Broad Extraction Strategies with CSS Selectors

When crawling multiple heterogeneous domains, relying on rigid structural selectors becomes impractical. Broad extraction relies on universal HTML elements that exist across most web pages, regardless of their specific layout.

import scrapy

class UniversalTextScraper(scrapy.Spider):
    name = 'multi_domain_text'
    start_urls = ['https://domain-a.com', 'https://domain-b.net', 'https://domain-c.org']

    def parse(self, response):
        text_blocks = response.css('p::text').getall()
        for text in text_blocks:
            yield {'text_content': text}

By targeting common elements like paragraph tags, the scraper adapts to diverse layouts without requiring domain-specific logic. To harvest navigational routes universally, anchor tags and their source attributes provide a reliable pattern:

import scrapy

class HrefExtractor(scrapy.Spider):
    name = 'absolute_url_fetcher'
    start_urls = ['https://sample-site.io']

    def parse(self, response):
        relative_paths = response.css('a::attr(href)').getall()
        for path in relative_paths:
            yield {'absolute_path': response.urljoin(path)}

The .css() method captures all anchor references, and urljoin() guarantees fully qualified URLs, making the extraction robust across any target architecture.

Tags: scrapy

Back to List

Prev: Comprehensive Metasploit Setup and Usage Guide

Next: Reading and Saving Videos with OpenCV-Python

Fading Coder

Data Extraction in Scrapy: Targeted Parsing and Broad Extraction Patterns

Targeted HTML Extraction with XPath

Broad Extraction Strategies with CSS Selectors

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor