Skip to content

RAG Extraction Pipeline

Build a knowledge graph from document extractions, storing entities and relationships with per-source confidence scores.

Setup

from hypabase import Hypabase

hb = Hypabase("knowledge.db")

Extract and store

Simulate extracting facts from three documents with different confidence levels:

# High-quality academic paper
with hb.context(source="doc_arxiv_2401", confidence=0.92):
    hb.edge(["transformer", "attention", "nlp"], type="concept_link")
    hb.edge(["bert", "transformer", "pretraining"], type="builds_on")

# Blog post — lower confidence
with hb.context(source="doc_blog_post", confidence=0.75):
    hb.edge(["transformer", "gpu", "training"], type="requires")
    hb.edge(["attention", "memory", "scaling"], type="tradeoff")

# Textbook with moderate confidence
with hb.context(source="doc_textbook_ch5", confidence=0.5):
    hb.edge(["rnn", "lstm", "attention"], type="evolution")

Each extraction batch gets its own source and confidence. The provenance context manager handles this cleanly.

Query patterns

Entity retrieval

Find all relationships involving a concept:

edges = hb.edges(containing=["transformer"])
# Returns 3 edges: concept_link, builds_on, requires

Source filtering

Retrieve facts from a specific document:

edges = hb.edges(source="doc_arxiv_2401")
# Returns 2 edges from the arxiv paper

Confidence-based retrieval

Only include high-quality extractions in your RAG context:

high_quality = hb.edges(min_confidence=0.8)
# Returns 2 edges (arxiv paper), excludes blog post and textbook

Multi-hop discovery

Find paths between concepts across documents:

paths = hb.paths("bert", "nlp")
# bert → transformer → nlp (across two extraction sources)

N-ary fact preservation

A single edge stores the 3-way concept link:

concept_links = hb.edges(type="concept_link")
assert len(concept_links[0].node_ids) == 3
# ["transformer", "attention", "nlp"] — not three separate pairs

Integration with LLM extraction

A typical pipeline:

import json

def extract_and_store(document_id, text, hb):
    """Extract facts from text using an LLM and store in Hypabase."""
    # Your LLM extraction logic here
    # Returns: [{"entities": [...], "type": "...", "confidence": ...}, ...]
    extractions = llm_extract(text)

    with hb.context(source=document_id, confidence=0.85):
        with hb.batch():  # Single save for all extractions
            for fact in extractions:
                hb.edge(
                    fact["entities"],
                    type=fact["type"],
                    confidence=fact.get("confidence"),  # Override if LLM provides per-fact score
                )

RAG retrieval function

def retrieve_context(query_entities, hb, min_confidence=0.7):
    """Retrieve structured relationships for RAG context."""
    edges = hb.edges(
        containing=query_entities,
        min_confidence=min_confidence,
    )
    # Format for LLM context
    facts = []
    for e in edges:
        facts.append(
            f"{e.type}: {' + '.join(e.node_ids)} "
            f"(source={e.source}, confidence={e.confidence})"
        )
    return "\n".join(facts)

This gives your LLM structured, provenance-tracked relationships as context.