Healthcare RCM · Agentic AI · 2026

Denial Triage & Appeal Drafting Engine

Agentic RCM reference for podiatry billing: parses 835 ERA, classifies denials, drafts citation-grounded appeals, queues for human coder review.

Self-Directed Reference Build · Solo

AI / MLHealthcare

Autonomous submission: never (human-in-loop)
Compliance guard: veto on hallucinated citation
Audit: append-only log + replay CLI

Problem

Podiatry-billing denials concentrate in a small number of categories where the recoverable revenue per practice is high enough to justify dedicated tooling. Routine-foot-care denials alone reach into the low-five-figures per practice per year, often because of missing at-risk Q-modifiers or frequency-limit exceedances. Modifier 25 and global-period bundling denials are a second growing category under payer AI scrutiny in 2026. The appeal-drafting workflow today is manual, repetitive, and slow.

Approach

Spec-driven, one spec per EPIC. The 835 ERA parser, denial classifier, recoverability scorer, appeal drafter, Compliance Guard, audit log, and coder review queue each have a markdown spec under specs/. Implementation follows the spec, not the other way around.
Deterministic parser, agentic everything else. The X12 5010 835 ERA parser lives in its own package with no model calls, so the wire-level surface is independently testable. The agents operate on the parsed denial set, never on raw 835 lines.
Typed denial taxonomy. CARC and RARC reference codes plus a podiatry-specific denial mapping live in a typed package. Mapping changes go through review and version-bump.
Hybrid retrieval over policy guidance. Qdrant for dense recall, BM25 for exact-match recall on CARC and RARC identifiers, Reciprocal Rank Fusion to combine. Appeal drafts cite the retrieved evidence; uncited paragraphs do not ship.
Compliance Guard with veto power. NLI verification on every cited claim. A draft with a hallucinated citation never reaches the queue.
One audited LLM wrapper. Every model call goes through one function that logs prompt, response, model, latency, tokens, and cost. There is no ad-hoc OpenAI client elsewhere in the codebase.
Versioned prompts. Prompts live in files under packages/agents/prompts/, never inline. Prompt changes are reviewable, diffable, and revertable.
Append-only audit with replay. Every agent decision flows into an append-only log. A replay CLI reconstructs the agent state from the log for any session.
No autonomous submission. The system drafts and queues. A human coder submits. There is no payer, clearinghouse, or EHR write integration, and no plan to acquire one.

Stack

Orchestration: LangGraph for the denial-classification and appeal-drafting agents. Deterministic Python for the X12 parser.
Models: OpenAI GPT-4o for synthesis, text-embedding-3-large for dense retrieval.
Retrieval: Qdrant for dense, BM25 for sparse, Reciprocal Rank Fusion to combine.
Storage: PostgreSQL for the review queue and audit log, Qdrant for embeddings.
API: FastAPI for the review-queue service.
UI: React 18 with Vite and TypeScript for the coder UI.
Eval: RAGAS plus classifier-F1 and guard-recall harnesses, PR-blocking in CI.
Tooling: ruff, mypy strict, pytest, GitHub Actions, Docker Compose for local Postgres and Qdrant.

Outcomes

This is a working reference implementation for the agentic-RCM pattern: spec-driven layout, deterministic wire-level parsing isolated from the agentic surface, citation-grounded drafting under a Compliance Guard veto, and a human-in-the-loop review queue.

All test data is obviously synthetic (e.g. TEST_PATIENT_001). The system has not been validated on real PHI and is not designed for real-PHI processing without a BAA-backed compliance pass.
Two highest-recoverable denial categories scoped first; adjacent categories (prior auth missing, eligibility, coding error, documentation insufficient, timely filing) tracked in the taxonomy but deferred.
Demo available on request.

Lessons

Separating the deterministic 835 parser from the agentic surface buys independent testability and replacement on each side. Mixing them is a future-rewrite tax.
A single audited LLM wrapper is the cheapest observability decision available. The first time a cost or token-budget incident happens, the wrapper pays for itself.
A draft that fails the Compliance Guard should not exist on the queue. A bypass flag is technical debt with a clinical-safety price tag.
Versioned prompts in files (not inline) make prompt regressions diffable and revertable. Prompt-as-code, not prompt-as-text.

Stack

Python 3.11FastAPILangGraphOpenAI GPT-4otext-embedding-3-largeQdrantPostgreSQLBM25 + Dense Retrieval (RRF)X12 5010 (835 ERA)RAGASReact 18 + Vite + TypeScriptDocker ComposeGitHub Actionsmypy strictruffpytest

Highlights

Spec-driven reference build for podiatry-billing RCM: one markdown spec per EPIC, CI-enforced quality gates, append-only audit log with replay CLI, and a single audited LLM wrapper that records prompt, response, model, latency, tokens, and cost on every call.
Deterministic 835 X12 5010 parser packaged separately from the agents, so the wire-level parsing surface is independently testable and replaceable. CARC and RARC reference codes plus a domain-specific denial mapping live in a typed taxonomy package.
Agents draft appeals; humans submit them. No autonomous submission to any payer or clearinghouse. The Compliance Guard has veto power: a draft with a hallucinated citation never reaches the coder review queue.
Two highest-recoverable denial categories scoped first: routine-foot-care denials (often missing at-risk Q-modifiers or exceeding frequency limits) and Modifier 25 / global-period bundling under increasing payer AI scrutiny.