Agentic RAG · Production · Dec 2025 to Feb 2026

PolicyMind

Production four-agent RAG platform with hybrid retrieval, NLI-based hallucination verification, and CI-enforced RAGAS gates.

Senior AI Engineer · Architect

AI / MLHealthcareEnterprise SaaS

Citation accuracy: 92.5%
p95 latency: < 800ms
Escalation recall: 100%

Problem

Clinical and policy content has two non-negotiables: every claim needs a citation, and every citation has to point to a passage that actually supports the claim. A naive RAG pipeline will hallucinate citations, retrieve the wrong chunk, or quietly degrade as the corpus grows. Worse, it will pass smoke tests while quietly breaking on the long tail.

Approach

Treat the RAG pipeline as a regular software system with regular software guarantees: contracts, tests, and CI gates.

Four-agent LangGraph orchestration. Orchestrator plans the run, Domain Router picks the retrieval strategy and tool set, Synthesis produces the cited answer, and Compliance Guard validates citations against the source via NLI before the response leaves the system. State is a typed DiagnosticSession object, transitions are auditable, and retries are conditional rather than unbounded.
Citation-first hybrid retrieval. Qdrant for dense recall, BM25 for lexical recall on policy ids and proper nouns, Reciprocal Rank Fusion to combine, and a cross-encoder rerank for the top of the funnel. Each stage is independently swappable and benchmarked.
Refusal as a first-class outcome. The compliance-guard agent enforces NLI-based hallucination verification. If retrieved context does not support an answer, the system escalates rather than guesses. Escalation recall is measured at 100% on the held-out set.
Replayable, traced runs. Every production session is logged through Langfuse with the prompt version, retrieval candidates, scores, and final answer. Any output can be reconstructed for audit.
CI-gated evaluations. RAGAS metrics run on every PR against a frozen gold set. Thresholds: faithfulness ≥ 88%, context recall ≥ 90%, escalation recall = 100%. A regression on any threshold fails CI.

Stack

API: Python 3.11, FastAPI.
Frontend: React.
Models: OpenAI GPT-4o for orchestration and synthesis, text-embedding-3-large for retrieval.
Index: Qdrant with HNSW.
Eval: RAGAS, custom NLI-based faithfulness checker over citation spans.
Observability: Langfuse traces, replayable end-to-end.
CI: GitHub Actions, eval job uses a frozen gold set committed alongside code.
Deployment: Docker Compose.

Outcomes

92.5% citation accuracy on the held-out gold set.
p95 end-to-end latency under 800ms with the four-agent pipeline.
100% escalation recall: the system never tries to answer when retrieval fails.
167 passing tests, including RAGAS evals, all gating CI.
Zero shipped regressions on the citation metric over the measurement window because the gates prevented them.

Lessons

Treat the eval set as production data: version it, review it like code, never overwrite it.
Faithfulness is a different metric from "looks right". A model that paraphrases the source confidently can still pass casual review and fail the metric.
A refusal path is cheaper than a wrong answer with a citation.
Build agentic systems by wiring deterministic state machines around the LLM, not by handing the LLM control of the loop.

Stack

Python 3.11OpenAI GPT-4oQdrantBM25 + RRFCross-Encoder RerankingFastAPIReactPostgreSQLLangfuseDocker ComposeRAGASGitHub Actions

Highlights

Architected a four-agent LangGraph system (Orchestrator, Domain Router, Synthesis, Compliance Guard) with a typed `DiagnosticSession` state object, ReAct-style tool-calling, conditional retry routing, and full execution auditability.
Citation-first hybrid retrieval (Qdrant + BM25 + RRF + cross-encoder rerank) holds 92.5% citation accuracy at p95 under 800ms across the production query mix.
Compliance-guard agent enforces NLI-based hallucination verification with 100% recall on escalation-critical queries, refusing rather than guessing when the retrieved context does not support an answer.
Containerized deployment with Docker, Langfuse tracing for replayable production runs, 167 passing tests, and CI-enforced RAGAS gates: faithfulness ≥ 88%, context recall ≥ 90%, escalation recall = 100%.