Trace, Eval, Prompt Iterate: Jira Bot to Prod Agent in 2 Weeks

Instrument prototypes with tracing day one to expose issues, write binary evals for failure modes before fixes, manage prompts remotely to iterate without redeploys—turning vibe-coded bots into reliable agents via the Agent Development Flywheel.

Instrument Agents Early for Precise Diagnosis

Tracing from day one via OpenTelemetry and Arthur Engine revealed the vibe-coded Jira bot's single-shot LLM-to-JSON limitations: hardcoded logic, no tool use or reasoning. This exposed three key failure modes without guesswork—ADF formatting errors (Markdown rendered as raw text in Jira), priority over-assignment (dev bugs tagged high like outages), and incomplete tickets missing repro steps, impact, environment details. Early visibility, as in Arthur's Part 1 best practices, enables confident shipping by showing exactly what agents do.

Target Failure Modes with Binary Evals Before Changes

Before prompt tweaks, define evals mapping to requirements: one verifies ADF in descriptions, another checks priority justification from Slack context, third confirms presence of repro steps, impact, environment. Keep evals binary pass/fail for objective measurement against real traces. This pre-change baseline, per Part 3 practices, prevents unverified fixes and catches regressions—e.g., post-refactor evals flagged forgotten ADF instructions and missing priority logic, fixed via prompt adds like "reserve high priority for high-impact issues."

Refactor to Tools and Remote Prompts for Fast Cycles

Shift from one-shot prompts to agentic flow: system prompt for ticket structure, editable tool descriptions (e.g., for Jira API calls), no code redeploys needed. Arthur Engine's prompt management versions changes, decoupling iteration from releases (Part 2 principle). Post-refactor, agent reasons over tools, asks clarifying questions for complete tickets—saving hours weekly while evals (Part 4) validate improvements instantly.

Agent Development Flywheel Scales Any Use Case

Cycle: Instrument → Write evals → Iterate prompts remotely → Validate with evals. Applied to simple Slack-to-Jira bot, it produced production-grade tracing, continuous checks, versioned prompts in two weeks. Handles internal tools or customer agents equally, moving beyond vibe-coding guesswork.

Summarized by x-ai/grok-4.1-fast via openrouter

5950 input / 1757 output tokens in 11214ms

© 2026 Edge