Provenance¶
Every edge in Hypabase carries two provenance fields: source and confidence. These are first-class parts of the data model, not bolted-on metadata.
Setting provenance per-edge¶
hb.edge(
["patient_123", "aspirin", "ibuprofen"],
type="drug_interaction",
source="clinical_decision_support_v3",
confidence=0.92,
)
If omitted, source defaults to "unknown" and confidence defaults to 1.0.
Context manager for bulk provenance¶
Use hb.context() to set default provenance for a block of operations:
with hb.context(source="clinical_records", confidence=0.95):
hb.edge(
["dr_smith", "patient_a", "aspirin", "headache", "mercy_hospital"],
type="treatment",
)
hb.edge(
["dr_jones", "patient_b", "ibuprofen", "fever"],
type="treatment",
)
# Both edges get source="clinical_records", confidence=0.95
Override within a context¶
Per-edge values override the context defaults:
with hb.context(source="extraction", confidence=0.8):
hb.edge(["a", "b"], type="x") # confidence=0.8
hb.edge(["c", "d"], type="y", confidence=0.99) # confidence=0.99
Nested contexts¶
Contexts can nest. The innermost context wins:
with hb.context(source="system_a", confidence=0.9):
hb.edge(["a", "b"], type="x") # source="system_a"
with hb.context(source="system_b", confidence=0.7):
hb.edge(["c", "d"], type="y") # source="system_b"
hb.edge(["e", "f"], type="z") # source="system_a" (restored)
Querying by provenance¶
Filter by source¶
Filter by confidence threshold¶
Combine provenance with other filters¶
Aggregating sources¶
The sources() method provides an overview of all provenance sources:
sources = hb.sources()
# [
# {"source": "clinical_records", "edge_count": 2, "avg_confidence": 0.95},
# {"source": "lab_results", "edge_count": 1, "avg_confidence": 0.88},
# ]
Each entry includes:
source— the source stringedge_count— number of edges from this sourceavg_confidence— mean confidence across all edges from this source
Use cases¶
Multi-source knowledge graphs¶
Track which AI model, document, or human produced each fact:
with hb.context(source="gpt-4o_extraction", confidence=0.85):
hb.edge(["transformer", "attention", "nlp"], type="concept_link")
with hb.context(source="manual_review", confidence=0.99):
hb.edge(["transformer", "attention", "nlp"], type="concept_link_verified")
Audit trails¶
Know exactly which source contributed each relationship:
# What did the legal review say?
legal_edges = hb.edges(source="legal_review")
# What do we trust?
trusted = hb.edges(min_confidence=0.85)
# What's unreliable?
all_sources = hb.sources()
low_quality = [s for s in all_sources if s["avg_confidence"] < 0.7]
Confidence-based retrieval¶
In RAG pipelines, retrieve only high-confidence relationships: