Enterprise Infrastructure

Agentic Data

An enterprise context graph and agentic retrieval planner that turns fragmented institutional knowledge into a temporal, permissioned evidence graph — then assembles exactly what an AI agent needs through an iterative, budget-aware retrieval loop.

Retrieval is not a feature of an AI agent. It is the core product.

The Thesis Architecture → Retrieval Loop →

The problem

Enterprise knowledge is dying. Every organization generates a continuous stream of institutional knowledge — pull requests, incident timelines, RCA documents, architecture decisions, customer escalations, deployment logs. These artifacts are scattered across GitHub, Jira, Slack, Confluence, PagerDuty, ServiceNow, and dozens more systems.

The connections between them — this PR caused that incident, which led to this RCA, which changed this decision, which was later reverted — live only in people's heads. When those people leave, the connections disappear. The organization retains its documents but loses its understanding.

This is not a search problem. You can find the documents. What you can't find is the meaning between them.

Not RAG

The retrieval layer we have today — embed, top-k, stuff into prompt — is fundamentally insufficient for enterprise knowledge.

Documents are not the unit of knowledge. Events are. A PR, an incident, a decision — these are episodes that happened at a specific time, involved specific actors, and had specific consequences.
Similarity is not relevance. When you ask "why did the payment service go down?", the answer is a causal chain through time, not the 10 most semantically similar paragraphs.
Retrieval is not a single call. It is an iterative control loop that plans what evidence it needs, searches, evaluates what's missing, expands through graph traversal, and stops when sufficient — or admits that it isn't.

Three interlocking ideas

Event-Sourced Context Graph

Every piece of enterprise knowledge is normalized into a typed, timestamped, permissioned Context Object with actors, entities, summaries, provenance, and validity windows. COs are linked by 35 typed edges spanning causal, decisional, lifecycle, structural, and customer relationships.

Multi-Resolution Memory

Raw data enters at full fidelity, then exists at multiple levels of abstraction: immutable originals in object storage, chunked and embedded segments for hybrid search, and structured canonical Notes that represent "what we believe now" — editable, version-tracked, with evidence citations and confidence scores.

Agentic Retrieval

Not embed-and-retrieve. An iterative control loop: interpret task, select strategy, hybrid search with RRF fusion, graph expansion via typed edges, five-component sufficiency scoring, budget-aware stop conditions. The retrieval equivalent of a database query planner.

35 typed edges

Each edge is directional, carries a confidence score (0.0–1.0), provenance method (deterministic rule, LLM extraction, human verified), evidence spans pointing back to source text, and validity windows.

Causal root_caused_by, triggers_incident, introduces_issue, fixes_issue, contributes_to, mitigates

Decisional decides, rationale_for, approves, rejects, accepts_risk, sets_policy

Lifecycle supersedes, deprecated_by, reverted_by, implements, invalidated_by

Structural part_of, references, mentions, summarizes, same_as

Customer requested_by_customer, impacts_customer, escalated_by

Agentic retrieval loop

Instead of embed(query) → top-k → stuff into prompt, the retrieval planner runs an iterative control loop that hunts for a sufficient evidence graph.

Interpret Extract intent, entities, temporal constraints

→

Strategy Select retrieval strategy by intent

→

Search Hybrid lexical + dense, RRF fusion

→

Expand Multi-hop graph traversal via typed edges

→

Rerank 5-component sufficiency scoring

→

Check Sufficient? Assemble. Else refine, iterate.

The loop is budget-aware (max retrieval calls, graph hops, latency, model calls) and stops on three conditions: sufficiency threshold reached, budget exhausted, or diminishing returns detected.

Intent-driven strategies

The retrieval planner selects a strategy based on the query's intent, determining which edge types to prioritize during graph expansion.

debug / root_cause Causal Chain root_caused_by, triggers_incident, fixes_issue

decision_rationale Current State + Deltas supersedes, decides, rationale_for

impact_analysis Timeline Reconstruction references, triggers_incident, mitigates

policy_question Precedent Search sets_policy, approves, accepts_risk

customer_answer Hybrid Mix of all edge types

What is live now

Agentic Data is operational. The following infrastructure is built and tested.

27 API Routes Ingest, retrieve, graph, notes, audit, connectors, jobs, health

315 Tests All with mocked external services, no Docker required

35 Edge Types Causal, decisional, lifecycle, structural, customer

21 Postgres Tables COs, entities, edges, ACLs, audit, segments, embeddings

5 Retrieval Strategies Causal chain, current state, timeline, precedent, hybrid

5 Sufficiency Dimensions Coverage, recency, authority, diversity, completeness

Architecture

Four data stores, each for what it's best at. FastAPI + SQLAlchemy 2 (async) ties the layers together with JWT auth, RBAC, rate limiting, and Docker Compose for local and staged deployments.

PostgreSQL Source of truth, OLTP, ACLs

21 tables: COs, entities, edges, ACLs, audit trails, segments, embeddings

OpenSearch Hybrid lexical + dense vector search

co_segments_v1 index with KNN vectors (HNSW, cosine), RRF fusion

Neo4j Causal/temporal graph traversals

CO nodes, Entity nodes, 35 typed edge relationships, Cypher queries

S3 / MinIO Raw payload storage

Original documents, attachments, diffs, content-addressed keys

Graph query API

Beyond flat search, the graph API provides structured traversals for the questions that actually matter in enterprise operations.

Traverse

Expand from seed nodes by typed edges with hop depth control

Subgraph

Extract subgraph among a node set, preserving typed connections

Root Cause Chain

Trace causal chain from an incident up to 4 hops

Supersession

Follow supersession chain to find the current canonical version

Decision Chain

Reconstruct the decision rationale chain for a service or entity

Truth maintenance

A knowledge system that only accumulates and never revises will inevitably hallucinate with confidence. The canonical memory layer is self-correcting.

Supersession chains — when a new note replaces an old one, the chain is preserved so you can trace how understanding evolved
Deprecation markers — disputed or outdated notes are flagged, not deleted, maintaining the full audit trail
Contradiction detection — new evidence is checked against existing canonical notes for conflicts
Reconsolidation loops — detected contradictions trigger review workflows: suggest update, approve/reject, verify
Validity windows — notes and COs carry explicit valid_from/valid_to bounds; expired items are automatically caught by maintenance jobs

Connectors

Source system integrations that fetch items updated since a timestamp and ingest them through the full pipeline (persist → extract edges → index → graph sync).

GitHub

Pull requests, commits, issues — fetches items updated since last sync

Jira

Tickets, epics, sprints — normalized into Context Objects with typed edges

PagerDuty

Incidents, alerts, escalations — with causal chain extraction

Design philosophy

Relevance is not similarity. Two documents can be semantically identical and one can be wrong (superseded, reverted). Relevance is defined by causality, temporal validity, authority, and organizational state.
Permissions are not a bolt-on. ACL filtering happens before any model sees any content. Graph traversals cannot leak restricted nodes. Every piece of evidence carries provenance.
Don't just retrieve more. If answer quality degrades as you add context, the system collapses under scale. Aggressive reranking, evidence budgets, and sufficiency-based stop conditions are architectural requirements.
The graph is the product. The context graph is the durable asset that compounds over time. Every ingested document makes it richer. Every query validates or challenges existing connections. This is the flywheel.

Roadmap

2026 Q1

Core Platform

Python/FastAPI with 27 endpoints. 21-table Postgres schema. OpenSearch hybrid indexing. Neo4j graph store. S3 object storage. 35 typed edges. Agentic retrieval planner with 5 strategies. Truth maintenance. JWT/RBAC auth. 315 passing tests. GitHub/Jira/PagerDuty connectors. Docker Compose.

2026 Q2

Production Hardening

OpenAI + cross-encoder embedding/reranking providers. Full Alembic migration suite. Expanded connector coverage (Slack, Confluence, ServiceNow). Contradiction detection with LLM backend. Interactive evidence explorer UI.

2026 Q3

Scale & Intelligence

Proactive knowledge maintenance (stale note detection, missing rationale surfacing). Multi-tenant deployment. Evaluation framework with golden question sets. CI/CD with full integration tests.

2026 Q4

Infrastructure Layer

Agent-facing API for external AI systems. Webhook-based real-time ingestion. Federated graph queries across tenants. Scale to trillions of tokens without degrading retrieval quality.

The graph is not an implementation detail.

It is the durable asset that compounds over time.

Every ingested document makes it richer. Every query validates or challenges existing connections. Every new edge makes future queries more precise. This is the flywheel.

Get involved

Agentic Data is being built in public. The core platform is functional with 315 passing tests and active development toward production hardening and expanded connector coverage.

View GitHub Read the Architecture Contact Jason

Similarity is not relevance.

Causality is.