Semantic Data Extraction on the Modern Web: Microformats and HTML Parsing
Foundations of Microformats and Semantic Markup
Microformats are lightweight, standards-based conventions that embed structured metadata directly into existing HTML elements. By leveraging familiar attributes like class, rel, and typeof, developers can annotate unstructured content without introducing proprietary markup. This approach maintains backward compatibility with browsers while enabling automated parsers, search engines, and aggregation tools to interpret page semantics accurately.
For instance, a basic hyperlink can be enhanced with a homepage indicator:
<a href="https://example.com">Company Website</a>
Applying microformat conventions transforms it into a semantically rich node:
<a class="url" rel="homepage" href="https://example.com">Company Website</a>
The widely adopted hCard specification maps contact data onto HTML wrappers using vcard parent containers and standardized child classes such as fn (full name), org (organization), tel (telephone), and email. These annotations allow downstream applications to extract relational data reliably.
Extracting Geospatial Coordinates from Static Pages
Many legacy websites embed location data within invisible or minimally styled elements. Common patterns include concatenated strings separated by semicolons or dedicated latitude/longitude child nodes. The following routine demonstrates a resilient extraction method that adapts to both structures.
import requests
from bs4 import BeautifulSoup
def fetch_geolocation(target_url):
headers = {"User-Agent": "DataExtractionBot/1.0"}
response = requests.get(target_url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
geo_node = soup.select_one("span.geo")
if not geo_node:
return None
raw_text = geo_node.get_text().strip()
# Pattern A: compact "lat; lon" format
if ";" in raw_text:
lat_str, lon_str = raw_text.split(";")
return (lat_str.strip(), lon_str.strip())
# Pattern B: nested <span> children with explicit classes
lat_el = geo_node.find(class_="latitude")
lon_el = geo_node.find(class_="longitude")
if lat_el and lon_el:
return (lat_el.get_text(strip=True), lon_el.get_text(strip=True))
return None
# Usage example:
coords = fetch_geolocation("https://en.wikipedia.org/wiki/Amsterdam")
print(f"Parsed coordinates: {coords}")
</span>
Converting Extracted Points to KML for Map Visualization
Once coordinates are isolated, they can be serialized into Keyhole Markup Language (KML) for integration with geographic platforms. KML relies on hierarchical XML tags rather than inline HTML attributes, which technicaly places it outside the microformat definition. While microformats require semantics to reside within HTML/CSS structures, KML operates as a standalone document exchange format optimized for spatial rendering.
def generate_kml(coordinates, placename="Extracted Location"):
lat, lon = coordinates
kml_template = f'''<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Placemark>
<name>{placename}</name>
<Point>
<coordinates>{lon},{lat},0</coordinates>
</Point>
</Placemark>
</kml>'''
return kml_template
kml_output = generate_kml(coords, "Amsterdam City Center")
print(kml_output[:100] + "...")
To visualize this payload, host the generated string on a publicly accessible endpoint, then pass the resulting URL to a KML overlay layer in mapping libraries. Most GIS frameworks automatically parse the XML tree and render the marker at the specified projection.
Structural Parsing of Recipe Aggregators
Cleaning raw HTML into consumption-ready datasets requires targeted DOM navigation. Modern recipe portals frequently employ predictable class hierarchies, enabling deterministic scrapers. The following implementation abstracts common extraction patterns into a single utility function, utilizing CSS selector fallbacks and text normalization.
from bs4 import BeautifulSoup
import requests
def aggregate_recipe_data(page_url):
resp = requests.get(page_url)
resp.raise_for_status()
parser = BeautifulSoup(resp.content, "lxml")
payload = {"title": "", "steps": [], "components": []}
# Resolve title via breadcrumb-style delimiter
header_block = parser.select_one(".post-header, h1.recipe-title")
if header_block:
raw_title = header_block.get_text(separator="|", strip=True)
payload["title"] = raw_title.split("|")[-1].strip()
# Collect ingredient lists
ing_container = parser.select_one(".recipe-ingredients, ul.ingredients-list")
if ing_container:
payload["components"] = [
item.get_text(strip=True)
for item in ing_container.select("li:not(.skip-this)")
]
# Collect procedural steps
step_container = parser.select_one(".recipe-instructions, ol.instructions")
if step_container:
payload["steps"] = [
step.get_text(strip=True)
for step in step_container.select("li.step-item")
]
return payload
# Execution:
sample_url = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"
recipe_data = aggregate_recipe_data(sample_url)
print(f"Structure extracted: {list(recipe_data.keys())}")
This design isolates boundary conditions, applies selective text stripping, and returns a normalized dictionary. When deployed across multiple domains, swapping the selector tuples accommodates platform-specific DOM variations without altering the core parsing logic.
Querying Schema.org Graphs with RDFLib
Embedded microdata adhering to Schema.org vocabularies can be transformed into triplestore-compatible graphs. Libraries like rdflib automate the conversion of inline HTML annotations into Subject-Predicate-Object tuples, enabling SPARQL querying and format serialization.
from rdflib import Graph, Namespace
from rdflib.namespace import RDF, RDFS
def analyze_microdata_graph(endpoint_uri):
rdf_graph = Graph()
try:
rdf_graph.parse(source=endpoint_uri, format="html")
except Exception as e:
raise ValueError(f"Graph construction failed: {e}")
schema_ns = Namespace("http://schema.org/")
subject_count = len(rdf_graph.subjects(RDF.type, schema_ns.Thing))
print(f"Resolved triples: {len(rdf_graph)}")
print(f"Inferred entities: {subject_count}")
# Display initial node relationships
for idx, (subj, pred, obj) in enumerate(rdf_graph[:15]):
print(f"[{idx}] {str(subj)[:40]:<40} | {str(pred)[:30]:<30} | {str(obj)[:20]}")
return rdf_graph
# Generate and export the derived model
model = analyze_microdata_graph("https://dbpedia.org/resource/Michael_Jackson")
serialized_output = model.serialize(format="json-ld")
print(serialized_output[:150] + "...")
The pipeline above validates the ingestion step, filters top-level type assertions, and exports the graph into JSON-LD. Shifting output formats allows interoperability testing between JSON-based API consumers and traditional XML pipelines. Comparing implementations like Open Graph tags against Schema.org reveals distinct design philosophies: social graph protocols prioritize engagement metrics and share previews, whereas vocabulary-driven schemas emphasize entity resolution, property cardinality, and cross-platform standardization.