Building Knowledge Graphs from Open Data: Architecture and Lessons

A knowledge graph is more than a database. It is a representation of reality where entities (people, organizations, documents, events) are connected by typed, directed relationships. Building one from open data sources is an exercise in controlled chaos.

The ingestion challenge

Open data comes in every format imaginable. RSS feeds deliver XML. Government APIs use OData, SPARQL, SRU, or custom REST. Patent offices publish bulk downloads. Court systems use proprietary schemas. News sources require scraping.

The first architectural decision: a connector abstraction layer. Every source, regardless of protocol or format, must produce the same output — a normalized item with title, content, source attribution, timestamp, and URL. The connector handles protocol-specific logic; the pipeline handles everything downstream.

Prioris uses seven connector types: RSS, REST API, web scraper (Playwright), CKAN, SPARQL, OData, and SDMX. Each connector type has a base class with error boundaries, retry logic, rate limiting, and health monitoring. Adding a new source means creating a configuration entry, not writing code.

Deduplication at scale

The same event is often reported by multiple sources. A drug approval appears in the FDA database, the manufacturer's press release, trade publications, and news outlets. Without deduplication, the knowledge graph would be dominated by redundant information.

Prioris uses three-layer deduplication:

URL hash: exact duplicate detection, O(1) lookup
Title fuzzy matching: catches reformulations and translations (threshold: 0.85 similarity)
Semantic embedding similarity: catches articles about the same topic with entirely different framing (threshold: 0.92 cosine similarity)

Entity resolution

The same entity appears in many forms across sources. Named entity recognition (NER) extracts mentions from text. Wikidata provides canonical identifiers (Q-numbers) for disambiguation. The resolution pipeline: extract mentions, search Wikidata, resolve to Q-numbers, merge duplicates.

Entity resolution is the foundation of the knowledge graph. Without it, you have a document database. With it, you can answer questions like "Show me everything related to this organization across all domains."

Relationship extraction

Typed relationships connect entities in the knowledge graph. A court ruling cites a regulation. A patent references a research paper. An organization lobbied for a policy change. A person serves as director of a company.

Relationship extraction uses a combination of rule-based patterns (for structured sources like patent citations) and LLM-based extraction (for unstructured text). Each relationship has a type, direction, confidence score, and provenance chain back to the source item.

Contradiction detection

One of the most valuable features of a cross-domain knowledge graph is contradiction detection. When two high-reliability sources make conflicting claims, that contradiction is itself intelligence.

Prioris uses embedding similarity combined with sentiment analysis to flag potential contradictions. When items are semantically similar but make opposing claims, the system creates a contradiction alert with both items, their sources, and their reliability scores.

Lessons learned

Source reliability varies wildly. Government primary sources are 95%+ reliable. Trade publications are 80-90%. News aggregators can be below 70%. Bayesian priors on source reliability, updated by track record, prevent low-quality sources from polluting the graph.

Temporal context matters. A connection between two items published on the same day is more meaningful than one between items published months apart. Time-weighted edge scoring captures this.

Scale is the feature. A knowledge graph with 100 items is a database. A knowledge graph with 100,000 items connected across 26 domains starts revealing patterns that no human analyst could find manually.