Agents Are Workflows: Build Reliable AI Like Louisa

Workflows Beat Autonomous Agents for Predictable Tasks

Anthropic defines workflows as code-predefined steps versus agents where LLMs choose actions and order. Louisa, an open-source tool for GitHub/GitLab release notes, exemplifies a workflow: webhook triggers on tag push, fetches commits/PRs, prompts Claude to generate user-benefit-focused notes grouped by product area (filtering noise like CI updates), then publishes to Releases and Slack. This predictability enables debugging and consistency—unlike agents, which risk non-determinism. Workflows deliver value without autonomy hype, turning repetitive tasks (e.g., manual changelogs across languages/formats) into zero-touch automation saving hours weekly.

Trade-off: Less flexibility than agents, but superior reliability for production. Instrument from day one to trace inputs (prompt context), LLM outputs, tokens, errors end-to-end, revealing failures like poor prompts or API issues.

Non-Engineers Build Production AI with Simple Stacks

Product managers can ship without deep coding: Use Claude Code to describe needs in plain English for rapid iteration. Louisa's stack—Node.js webhook listener, Claude LLM call, Arthur Engine for observability, Vercel deploy—requires git clone, env vars (API keys), and webhook setup. No manual steps post-deploy; fork and adapt for tasks like status reports, support summaries, or deployment checklists where inputs are API-accessible and outputs clear.

Prompts drive quality: Specify user benefits first ("improves X for you"), group by area, exclude irrelevancies. Modularize prompts outside code for iteration without redeploys. Outcome: Polished notes humans struggle to write consistently.

Reliability Loop: Observe, Evaluate, Experiment

AI non-determinism demands continuous checks: Trace every run (e.g., Arthur Toolkit views full Louisa traces), define "good" via evals (no hallucinations, correct grouping), A/B test prompt versions. Arthur's series emphasizes this: Observability catches issues pre-user; prompt management enables safe tweaks; evals detect data shifts; experiments ensure fixes don't regress.

Start small: Pick one resented manual task, prototype imperfectly, iterate via traces. Imperfect tools like v1 Louisa still outperform manual work and build AI intuition for workforce shifts.