LLM as Lossy Parser: Constrained Decoding Prevents Hallucinations

Treat LLMs solely as schema-conformant parsers for unstructured clinical notes, not decision-makers. Compile Pydantic models into finite-state machines using Outlines or XGrammar to mask invalid tokens during generation, ensuring outputs like VitalSignCode enums (e.g., "8867-4") are always valid—no malformed JSON or hallucinations possible.

Make schemas permissive with Optional fields (e.g., subject_id: str | None), allowing the LLM to output blanks for uncertain data. This yields honest extractions: filled fields are valid; blanks trigger downstream Python logic or review. Example:

import outlines
from schemas.observation import RawObservation
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
generator = outlines.generate.json(model, RawObservation, sampler=outlines.samplers.greedy())
raw_obs: RawObservation = generator(prompt, max_tokens=512)

Post-extraction, verify grounding by checking if emitted numerics/subject_ids appear as substrings in source text, rejecting ungrounded outputs.

Deterministic Python Core: Compute and Validate Without LLMs

Offload all logic to auditable Python: unit conversions (e.g., Fahrenheit to Celsius via (F-32) × 5/9), LOINC lookups (dicts), plausibility checks (ranges like heart rate 40-200), and deduplication (SHA-1). Validators are named functions with stable rule_ids:

@rule("VS-003", FindingSeverity.WARN, "value_numeric", "Heart rate sanity range")
def check_hr_range(obs: Observation, report: ValidationReport) -> None:
    if obs.vs_code == VitalSignCode.HEART_RATE:
        if not (40 <= obs.value_numeric <= 200):
            report.add(ValidationFinding(rule_id="VS-003", ...))

Validators flag ~15% of records via needs_judge based on WARN/ERRORs, enabling bit-identical re-runs for audits.

Conditional LLM Judge and HITL: Scale Safely at Low Cost

Invoke a cheap judge (e.g., Claude Haiku) only on flagged records using constrained tool calls—85% skip at $0, 15% cost ~$0.001 each, netting $0.15/1K records. Judge outputs must match JSON schema; low confidence (<0.4) or human_review routes to HITL.

HITL triggers: validator ERRORs (urgent), judge low confidence/unavailable, or judge request—~2% of records. HITL uses append-only JSONL queues with ReviewPackets (input/output side-by-side, findings, audit chain). Humans approve (ESignature), reject, or amend with controlled reason codes (e.g., transcription_error), preserving originals via hash-chained Amendments.

Run all LLMs at temperature=0.0 and fixed seed=42 for reproducibility.

Inherent ALCOA++/21 CFR Part 11 Compliance via Data Structures

Every LLM-touched record logs AuditEvents with input/output hashes, excerpts, model snapshots (e.g., mistralai/Mistral-7B-Instruct-v0.3, outlines==0.0.46, prompt_hash), actor, UTC timestamp, and 7-year retention. Chain via prev_hash/chain_hash for tamper-proof trails—regulators tail JSONL for audits.

Amendments link back (prev_chain_hash), e-signatures bind full ReviewPackets. This satisfies ALCOA++ (Attributable, Legible, Contemporaneous, Original, Accurate +++) and Part 11 (§11.10 validation, §11.10(e) audit trails) in ~250 lines of Python, making traceability a hashed event stream, not documents.

Rejects agents for regulated domains: LLMs as components under Python/human authority, not drivers.