The Three-Part Self-Improvement Loop
To move beyond manual, slow iteration cycles, teams can build agents that learn from production data by implementing a three-pillar system:
- Practitioner-Led Steering: Experts in the field (e.g., accountants) provide the intuition needed to distinguish between actual system errors, user preferences, and expected workflow noise. Their corrections serve as the primary signal for system improvement.
- Production Traces as Evidence: The system must capture the full lifecycle of a task—from raw source material and extraction to final submission and expert modification. These traces allow engineers to pinpoint exactly where a failure occurred.
- Codex-Driven Iteration: Once failures are identified and grouped, they are converted into structured evaluation targets. Codex then investigates the root cause, proposes code changes, validates them against regression suites, and generates candidate pull requests for human review.
Turning Failures into Engineering Tasks
Rather than forcing every correction through an automated loop, the system uses a tiered approach to ensure safety and quality. When practitioners correct an agent's output, the system compares the predicted value against the final filed value to generate "review rows."
These rows are grouped to identify recurring patterns (e.g., consistently missing a specific tax field). Once a pattern is established, it becomes a "hill to climb" for Codex. Codex operates in a bounded environment that separates the writable product code from read-only production context, allowing it to:
- Analyze source packages and extraction schemas.
- Update mappers or graders to account for workflow noise.
- Run targeted evals to validate fixes before they reach production.
This process ensures that Codex works on scoped, evidence-backed tasks rather than vague alerts, keeping human engineers in control of architecture and product strategy while accelerating the development of complex capabilities.