Agentic Data
An enterprise context graph and agentic retrieval planner that turns fragmented institutional knowledge into a temporal, permissioned evidence graph — then assembles exactly what an AI agent needs through an iterative, budget-aware retrieval loop.
Retrieval is not a feature of an AI agent. It is the core product.
The problem
Enterprise knowledge is dying. Every organization generates a continuous stream of institutional knowledge — pull requests, incident timelines, RCA documents, architecture decisions, customer escalations, deployment logs. These artifacts are scattered across GitHub, Jira, Slack, Confluence, PagerDuty, ServiceNow, and dozens more systems.
The connections between them — this PR caused that incident, which led to this RCA, which changed this decision, which was later reverted — live only in people's heads. When those people leave, the connections disappear. The organization retains its documents but loses its understanding.
This is not a search problem. You can find the documents. What you can't find is the meaning between them.
Not RAG
The retrieval layer we have today — embed, top-k, stuff into prompt — is fundamentally insufficient for enterprise knowledge.
- Documents are not the unit of knowledge. Events are. A PR, an incident, a decision — these are episodes that happened at a specific time, involved specific actors, and had specific consequences.
- Similarity is not relevance. When you ask "why did the payment service go down?", the answer is a causal chain through time, not the 10 most semantically similar paragraphs.
- Retrieval is not a single call. It is an iterative control loop that plans what evidence it needs, searches, evaluates what's missing, expands through graph traversal, and stops when sufficient — or admits that it isn't.
Three interlocking ideas
Every piece of enterprise knowledge is normalized into a typed, timestamped, permissioned Context Object with actors, entities, summaries, provenance, and validity windows. COs are linked by 35 typed edges spanning causal, decisional, lifecycle, structural, and customer relationships.
Raw data enters at full fidelity, then exists at multiple levels of abstraction: immutable originals in object storage, chunked and embedded segments for hybrid search, and structured canonical Notes that represent "what we believe now" — editable, version-tracked, with evidence citations and confidence scores.
Not embed-and-retrieve. An iterative control loop: interpret task, select strategy, hybrid search with RRF fusion, graph expansion via typed edges, five-component sufficiency scoring, budget-aware stop conditions. The retrieval equivalent of a database query planner.
35 typed edges
Each edge is directional, carries a confidence score (0.0–1.0), provenance method (deterministic rule, LLM extraction, human verified), evidence spans pointing back to source text, and validity windows.
Agentic retrieval loop
Instead of embed(query) → top-k → stuff into prompt, the retrieval planner
runs an iterative control loop that hunts for a sufficient evidence graph.
The loop is budget-aware (max retrieval calls, graph hops, latency, model calls) and stops on three conditions: sufficiency threshold reached, budget exhausted, or diminishing returns detected.
Intent-driven strategies
The retrieval planner selects a strategy based on the query's intent, determining which edge types to prioritize during graph expansion.
What is live now
Agentic Data is operational. The following infrastructure is built and tested.
Architecture
Four data stores, each for what it's best at. FastAPI + SQLAlchemy 2 (async) ties the layers together with JWT auth, RBAC, rate limiting, and Docker Compose for local and staged deployments.
21 tables: COs, entities, edges, ACLs, audit trails, segments, embeddings
co_segments_v1 index with KNN vectors (HNSW, cosine), RRF fusion
CO nodes, Entity nodes, 35 typed edge relationships, Cypher queries
Original documents, attachments, diffs, content-addressed keys
Graph query API
Beyond flat search, the graph API provides structured traversals for the questions that actually matter in enterprise operations.
Expand from seed nodes by typed edges with hop depth control
Extract subgraph among a node set, preserving typed connections
Trace causal chain from an incident up to 4 hops
Follow supersession chain to find the current canonical version
Reconstruct the decision rationale chain for a service or entity
Truth maintenance
A knowledge system that only accumulates and never revises will inevitably hallucinate with confidence. The canonical memory layer is self-correcting.
- Supersession chains — when a new note replaces an old one, the chain is preserved so you can trace how understanding evolved
- Deprecation markers — disputed or outdated notes are flagged, not deleted, maintaining the full audit trail
- Contradiction detection — new evidence is checked against existing canonical notes for conflicts
- Reconsolidation loops — detected contradictions trigger review workflows: suggest update, approve/reject, verify
- Validity windows — notes and COs carry explicit valid_from/valid_to bounds; expired items are automatically caught by maintenance jobs
Connectors
Source system integrations that fetch items updated since a timestamp and ingest them through the full pipeline (persist → extract edges → index → graph sync).
Pull requests, commits, issues — fetches items updated since last sync
Tickets, epics, sprints — normalized into Context Objects with typed edges
Incidents, alerts, escalations — with causal chain extraction
Design philosophy
- Relevance is not similarity. Two documents can be semantically identical and one can be wrong (superseded, reverted). Relevance is defined by causality, temporal validity, authority, and organizational state.
- Permissions are not a bolt-on. ACL filtering happens before any model sees any content. Graph traversals cannot leak restricted nodes. Every piece of evidence carries provenance.
- Don't just retrieve more. If answer quality degrades as you add context, the system collapses under scale. Aggressive reranking, evidence budgets, and sufficiency-based stop conditions are architectural requirements.
- The graph is the product. The context graph is the durable asset that compounds over time. Every ingested document makes it richer. Every query validates or challenges existing connections. This is the flywheel.
Roadmap
Python/FastAPI with 27 endpoints. 21-table Postgres schema. OpenSearch hybrid indexing. Neo4j graph store. S3 object storage. 35 typed edges. Agentic retrieval planner with 5 strategies. Truth maintenance. JWT/RBAC auth. 315 passing tests. GitHub/Jira/PagerDuty connectors. Docker Compose.
OpenAI + cross-encoder embedding/reranking providers. Full Alembic migration suite. Expanded connector coverage (Slack, Confluence, ServiceNow). Contradiction detection with LLM backend. Interactive evidence explorer UI.
Proactive knowledge maintenance (stale note detection, missing rationale surfacing). Multi-tenant deployment. Evaluation framework with golden question sets. CI/CD with full integration tests.
Agent-facing API for external AI systems. Webhook-based real-time ingestion. Federated graph queries across tenants. Scale to trillions of tokens without degrading retrieval quality.
The graph is not an implementation detail.
It is the durable asset that compounds over time.
Every ingested document makes it richer. Every query validates or challenges existing connections. Every new edge makes future queries more precise. This is the flywheel.
Get involved
Agentic Data is being built in public. The core platform is functional with 315 passing tests and active development toward production hardening and expanded connector coverage.
Similarity is not relevance.
Causality is.