Data Extraction in Scrapy: Targeted Parsing and Broad Extraction Patterns
Targeted HTML Extraction with XPath
Scrapy provides built-in selectors that allow precise targeting of DOM elements using XPath or CSS expressions. XPath is particularly effective for navigating complex nested structures or locating nodes based on specific attributes.
import scrapy
class BlogScraper(scrapy.Spider):
name = 'blog_data_extractor'
start_urls = ['https://blog-site.org']
def parse(self, response):
for entry in response.xpath('//article[contains(@class, "entry")]'):
yield {
'headline': entry.xpath('.//h2/text()').get(),
'writer': entry.xpath('.//span[@id="creator"]/text()').get(),
'feedback_count': entry.xpath('.//div[@class="interaction"]/text()').get(),
}
The .xpath() method isolates specific DOM nodes, while .get() retrieves the first matching textual value inside that scope.
Broad Extraction Strategies with CSS Selectors
When crawling multiple heterogeneous domains, relying on rigid structural selectors becomes impractical. Broad extraction relies on universal HTML elements that exist across most web pages, regardless of their specific layout.
import scrapy
class UniversalTextScraper(scrapy.Spider):
name = 'multi_domain_text'
start_urls = ['https://domain-a.com', 'https://domain-b.net', 'https://domain-c.org']
def parse(self, response):
text_blocks = response.css('p::text').getall()
for text in text_blocks:
yield {'text_content': text}
By targeting common elements like paragraph tags, the scraper adapts to diverse layouts without requiring domain-specific logic. To harvest navigational routes universally, anchor tags and their source attributes provide a reliable pattern:
import scrapy
class HrefExtractor(scrapy.Spider):
name = 'absolute_url_fetcher'
start_urls = ['https://sample-site.io']
def parse(self, response):
relative_paths = response.css('a::attr(href)').getall()
for path in relative_paths:
yield {'absolute_path': response.urljoin(path)}
The .css() method captures all anchor references, and urljoin() guarantees fully qualified URLs, making the extraction robust across any target architecture.