Agentic Manual Testing: Verify AI Code Beyond Units

Execute Generated Code to Confirm It Works

Never trust LLM-generated code without execution—agents excel here by running it directly and iterating if it fails. Use python -c "...code..." for Python libraries to import modules and test snippets interactively; agents often discover this unprompted but respond well to reminders. For other languages, agents write temp files in /tmp (avoiding repo commits) and compile/run them. For JSON APIs in web apps, prompt agents to "explore" with curl, which uncovers edge cases across endpoints—fix failures via red/green TDD to add permanent tests. This catches crashes, missing UI elements, or uncovered details that pass units but fail in reality, ensuring features work as intended before release.

Automate Browser Testing for Realistic UI Validation

Web UIs demand browser automation since units can't replicate real interactions. Prompt agents with "test that with Playwright"—they pick bindings (Python/others) or playwright-cli, automating Chrome/Firefox/Safari to expose issues in live environments. Use CLIs like Vercel's agent-browser or Simon Willison's Rodney (via uvx rodney --help for auto-install and full usage docs). Rodney enables screenshots (for agent vision analysis), JS execution, scrolling, clicking, typing, and accessibility tree reading. Example prompt: "Use uvx rodney to manually test the UI at http://localhost:8000, look at screenshots, confirm it works." Issues found get codified into automated e2e tests, which agents maintain to counter flakiness from HTML changes—reducing past avoidance of browser tests.

Document Agent Work with Showboat for Transparency

Capture testing flows as artifacts using Showboat (uvx showboat --help teaches agents its API). Key commands: note for Markdown notes, exec to run/ record commands with outputs (prevents faking results), image for screenshots (pairs with Rodney). Prompt: "Use showboat note, exec, image to document your testing." This produces demo docs proving comprehensive verification, hoarding agent knowledge for future reference and building trust in solutions.