Trace, Eval, Prompt Iterate: Jira Bot to Prod Agent in 2 Weeks
Instrument prototypes with tracing day one to expose issues, write binary evals for failure modes before fixes, manage prompts remotely to iterate without redeploys—turning vibe-coded bots into reliable agents via the Agent Development Flywheel.
Instrument Agents Early for Precise Diagnosis
Tracing from day one via OpenTelemetry and Arthur Engine revealed the vibe-coded Jira bot's single-shot LLM-to-JSON limitations: hardcoded logic, no tool use or reasoning. This exposed three key failure modes without guesswork—ADF formatting errors (Markdown rendered as raw text in Jira), priority over-assignment (dev bugs tagged high like outages), and incomplete tickets missing repro steps, impact, environment details. Early visibility, as in Arthur's Part 1 best practices, enables confident shipping by showing exactly what agents do.
Target Failure Modes with Binary Evals Before Changes
Before prompt tweaks, define evals mapping to requirements: one verifies ADF in descriptions, another checks priority justification from Slack context, third confirms presence of repro steps, impact, environment. Keep evals binary pass/fail for objective measurement against real traces. This pre-change baseline, per Part 3 practices, prevents unverified fixes and catches regressions—e.g., post-refactor evals flagged forgotten ADF instructions and missing priority logic, fixed via prompt adds like "reserve high priority for high-impact issues."
Refactor to Tools and Remote Prompts for Fast Cycles
Shift from one-shot prompts to agentic flow: system prompt for ticket structure, editable tool descriptions (e.g., for Jira API calls), no code redeploys needed. Arthur Engine's prompt management versions changes, decoupling iteration from releases (Part 2 principle). Post-refactor, agent reasons over tools, asks clarifying questions for complete tickets—saving hours weekly while evals (Part 4) validate improvements instantly.
Agent Development Flywheel Scales Any Use Case
Cycle: Instrument → Write evals → Iterate prompts remotely → Validate with evals. Applied to simple Slack-to-Jira bot, it produced production-grade tracing, continuous checks, versioned prompts in two weeks. Handles internal tools or customer agents equally, moving beyond vibe-coding guesswork.