Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Semantic Data Extraction on the Modern Web: Microformats and HTML Parsing

Tech May 14 1

Foundations of Microformats and Semantic Markup

Microformats are lightweight, standards-based conventions that embed structured metadata directly into existing HTML elements. By leveraging familiar attributes like class, rel, and typeof, developers can annotate unstructured content without introducing proprietary markup. This approach maintains backward compatibility with browsers while enabling automated parsers, search engines, and aggregation tools to interpret page semantics accurately.

For instance, a basic hyperlink can be enhanced with a homepage indicator:

<a href="https://example.com">Company Website</a>

Applying microformat conventions transforms it into a semantically rich node:

<a class="url" rel="homepage" href="https://example.com">Company Website</a>

The widely adopted hCard specification maps contact data onto HTML wrappers using vcard parent containers and standardized child classes such as fn (full name), org (organization), tel (telephone), and email. These annotations allow downstream applications to extract relational data reliably.

Extracting Geospatial Coordinates from Static Pages

Many legacy websites embed location data within invisible or minimally styled elements. Common patterns include concatenated strings separated by semicolons or dedicated latitude/longitude child nodes. The following routine demonstrates a resilient extraction method that adapts to both structures.

import requests
from bs4 import BeautifulSoup

def fetch_geolocation(target_url):
    headers = {"User-Agent": "DataExtractionBot/1.0"}
    response = requests.get(target_url, headers=headers)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    geo_node = soup.select_one("span.geo")
    if not geo_node:
        return None
        
    raw_text = geo_node.get_text().strip()
    
    # Pattern A: compact "lat; lon" format
    if ";" in raw_text:
        lat_str, lon_str = raw_text.split(";")
        return (lat_str.strip(), lon_str.strip())
        
    # Pattern B: nested <span> children with explicit classes
    lat_el = geo_node.find(class_="latitude")
    lon_el = geo_node.find(class_="longitude")
    if lat_el and lon_el:
        return (lat_el.get_text(strip=True), lon_el.get_text(strip=True))
        
    return None

# Usage example:
coords = fetch_geolocation("https://en.wikipedia.org/wiki/Amsterdam")
print(f"Parsed coordinates: {coords}")
</span>

Converting Extracted Points to KML for Map Visualization

Once coordinates are isolated, they can be serialized into Keyhole Markup Language (KML) for integration with geographic platforms. KML relies on hierarchical XML tags rather than inline HTML attributes, which technicaly places it outside the microformat definition. While microformats require semantics to reside within HTML/CSS structures, KML operates as a standalone document exchange format optimized for spatial rendering.

def generate_kml(coordinates, placename="Extracted Location"):
    lat, lon = coordinates
    kml_template = f'''<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>{placename}</name>
    <Point>
      <coordinates>{lon},{lat},0</coordinates>
    </Point>
  </Placemark>
</kml>'''
    return kml_template

kml_output = generate_kml(coords, "Amsterdam City Center")
print(kml_output[:100] + "...")

To visualize this payload, host the generated string on a publicly accessible endpoint, then pass the resulting URL to a KML overlay layer in mapping libraries. Most GIS frameworks automatically parse the XML tree and render the marker at the specified projection.

Structural Parsing of Recipe Aggregators

Cleaning raw HTML into consumption-ready datasets requires targeted DOM navigation. Modern recipe portals frequently employ predictable class hierarchies, enabling deterministic scrapers. The following implementation abstracts common extraction patterns into a single utility function, utilizing CSS selector fallbacks and text normalization.

from bs4 import BeautifulSoup
import requests

def aggregate_recipe_data(page_url):
    resp = requests.get(page_url)
    resp.raise_for_status()
    
    parser = BeautifulSoup(resp.content, "lxml")
    payload = {"title": "", "steps": [], "components": []}
    
    # Resolve title via breadcrumb-style delimiter
    header_block = parser.select_one(".post-header, h1.recipe-title")
    if header_block:
        raw_title = header_block.get_text(separator="|", strip=True)
        payload["title"] = raw_title.split("|")[-1].strip()
        
    # Collect ingredient lists
    ing_container = parser.select_one(".recipe-ingredients, ul.ingredients-list")
    if ing_container:
        payload["components"] = [
            item.get_text(strip=True) 
            for item in ing_container.select("li:not(.skip-this)")
        ]
        
    # Collect procedural steps
    step_container = parser.select_one(".recipe-instructions, ol.instructions")
    if step_container:
        payload["steps"] = [
            step.get_text(strip=True) 
            for step in step_container.select("li.step-item")
        ]
        
    return payload

# Execution:
sample_url = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"
recipe_data = aggregate_recipe_data(sample_url)
print(f"Structure extracted: {list(recipe_data.keys())}")

This design isolates boundary conditions, applies selective text stripping, and returns a normalized dictionary. When deployed across multiple domains, swapping the selector tuples accommodates platform-specific DOM variations without altering the core parsing logic.

Querying Schema.org Graphs with RDFLib

Embedded microdata adhering to Schema.org vocabularies can be transformed into triplestore-compatible graphs. Libraries like rdflib automate the conversion of inline HTML annotations into Subject-Predicate-Object tuples, enabling SPARQL querying and format serialization.

from rdflib import Graph, Namespace
from rdflib.namespace import RDF, RDFS

def analyze_microdata_graph(endpoint_uri):
    rdf_graph = Graph()
    try:
        rdf_graph.parse(source=endpoint_uri, format="html")
    except Exception as e:
        raise ValueError(f"Graph construction failed: {e}")
        
    schema_ns = Namespace("http://schema.org/")
    subject_count = len(rdf_graph.subjects(RDF.type, schema_ns.Thing))
    
    print(f"Resolved triples: {len(rdf_graph)}")
    print(f"Inferred entities: {subject_count}")
    
    # Display initial node relationships
    for idx, (subj, pred, obj) in enumerate(rdf_graph[:15]):
        print(f"[{idx}] {str(subj)[:40]:<40} | {str(pred)[:30]:<30} | {str(obj)[:20]}")
        
    return rdf_graph

# Generate and export the derived model
model = analyze_microdata_graph("https://dbpedia.org/resource/Michael_Jackson")
serialized_output = model.serialize(format="json-ld")
print(serialized_output[:150] + "...")

The pipeline above validates the ingestion step, filters top-level type assertions, and exports the graph into JSON-LD. Shifting output formats allows interoperability testing between JSON-based API consumers and traditional XML pipelines. Comparing implementations like Open Graph tags against Schema.org reveals distinct design philosophies: social graph protocols prioritize engagement metrics and share previews, whereas vocabulary-driven schemas emphasize entity resolution, property cardinality, and cross-platform standardization.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.