RAG Extraction Pipeline¶
Build a knowledge graph from document extractions, storing entities and relationships with per-source confidence scores.
Setup¶
Extract and store¶
Simulate extracting facts from three documents with different confidence levels:
# High-quality academic paper
with hb.context(source="doc_arxiv_2401", confidence=0.92):
hb.edge(["transformer", "attention", "nlp"], type="concept_link")
hb.edge(["bert", "transformer", "pretraining"], type="builds_on")
# Blog post — lower confidence
with hb.context(source="doc_blog_post", confidence=0.75):
hb.edge(["transformer", "gpu", "training"], type="requires")
hb.edge(["attention", "memory", "scaling"], type="tradeoff")
# Textbook with moderate confidence
with hb.context(source="doc_textbook_ch5", confidence=0.5):
hb.edge(["rnn", "lstm", "attention"], type="evolution")
Each extraction batch gets its own source and confidence. The provenance context manager handles this cleanly.
Query patterns¶
Entity retrieval¶
Find all relationships involving a concept:
Source filtering¶
Retrieve facts from a specific document:
Confidence-based retrieval¶
Only include high-quality extractions in your RAG context:
high_quality = hb.edges(min_confidence=0.8)
# Returns 2 edges (arxiv paper), excludes blog post and textbook
Multi-hop discovery¶
Find paths between concepts across documents:
N-ary fact preservation¶
A single edge stores the 3-way concept link:
concept_links = hb.edges(type="concept_link")
assert len(concept_links[0].node_ids) == 3
# ["transformer", "attention", "nlp"] — not three separate pairs
Integration with LLM extraction¶
A typical pipeline:
import json
def extract_and_store(document_id, text, hb):
"""Extract facts from text using an LLM and store in Hypabase."""
# Your LLM extraction logic here
# Returns: [{"entities": [...], "type": "...", "confidence": ...}, ...]
extractions = llm_extract(text)
with hb.context(source=document_id, confidence=0.85):
with hb.batch(): # Single save for all extractions
for fact in extractions:
hb.edge(
fact["entities"],
type=fact["type"],
confidence=fact.get("confidence"), # Override if LLM provides per-fact score
)
RAG retrieval function¶
def retrieve_context(query_entities, hb, min_confidence=0.7):
"""Retrieve structured relationships for RAG context."""
edges = hb.edges(
containing=query_entities,
min_confidence=min_confidence,
)
# Format for LLM context
facts = []
for e in edges:
facts.append(
f"{e.type}: {' + '.join(e.node_ids)} "
f"(source={e.source}, confidence={e.confidence})"
)
return "\n".join(facts)
This gives your LLM structured, provenance-tracked relationships as context.