Standalone Agents Fail Reliability for Interruptions

Pure agent setups for recurring tasks like security vulnerability alerts often underperform because they lack near-perfect accuracy. In one case, an agent processed GitHub Dependabot webhooks, filtering to critical issues and using GitHub APIs (checking CODEOWNERS files and recent commits) to assign owners, piping results to Slack. Despite prompt iterations with "CRITICAL: you must..." directives and model upgrades from GPT-4o-mini ("GPT 4.1") to GPT-4o ("GPT 5.4"), it included high and medium severity alerts ~20-30% of the time. This imperfection blocks rollout, as false positives would spam teams, unlike humans deterred by DM feedback. Verification evals (e.g., checking for non-critical mentions) add cost without full satisfaction, especially for unstructured tasks lacking tests or linting.

Code-Driven Scaffolding Delivers 100% Reliability

Shift to code-first workflows where software manages flow control by default, invoking agents only for non-deterministic steps like ownership inference. Feed Dependabot webhooks into a scripted pipeline: code deterministically filters to critical severity only, then calls the agent solely for owner lookup via GitHub APIs. Post-agent, code validates and posts to Slack. This hybrid runs perfectly every time, enabling aggressive rollout without babysitting. Extend with weekly pings on open criticals using the same split. Next: auto-generate fixes for human review/merge, building on existing Dependabot PR auto-merge (which solves a subset of issues).

Pattern Scales Across Recurring Tasks

Generalize as: (1) Log recent human/agent runs, prompts, instructions; (2) Prompt Claude/Codex for code-scaffolded replacement (minutes to prototype); (3) Deterministic code for all verifiable steps (filtering, sequencing); (4) Agents for judgment calls (e.g., ownership). Benefits: 100% uptime vs. agent flakiness, lower cost (fewer tokens), easier maintenance/debugging. Applied 5-6 months after initial agent (1 month post-model upgrade), it eliminated manual dashboard checks, minimizing human handoffs entirely for this process.