Verifiability: Why Code Agents Came First, Non-Code Lag
Agents excel in verifiable domains like code—does it run?—which is why Anthropic's Claude Co-Work, Cursor, and Google's early agent efforts started there. Non-code "outcome agents" (e.g., automating knowledge work) struggle because success is harder to prove. Co-Work sparked a $285B SaaS stock sell-off by demoing autonomous file/app work without coding, threatening expensive SaaS like Salesforce. Yet Co-Work remains research preview: it sleeps if you close your laptop, lacks always-on reliability, and demands obsessive per-session prompting. Wall Street panicked over tangible artifacts (e.g., Excel outputs via Microsoft partnership), but even this leader scores poorly on agent fundamentals.
"Code is something where it's easy to tell if it's good or not it's what we call a verifiable domain do you know how you verify it does it run right"—Nate Jones explains why code agents matured first, setting the bar for all others.
The 3-Question Framework: Test for Real Outcomes
To cut through demos, evaluate agents on:
- Persistent memory? Sessions shouldn't reset to zero; recall past context reliably.
- Editable artifacts? Outputs must be inspectable/buildable, not opaque black boxes.
- Compounding context? Architecture improves with use, patterns emerge over time. All three must be "yes" for compounding value. Co-Work scores 1.25/3: half-yes on memory (improving but prompt-dependent), strong artifacts (Excel prowess), no compounding.
This framework exposes hype: even high-demand tools like Co-Work thrive on partial wins due to product-market fit, forcing usage limit hikes at Anthropic.
Agent Reviews: Strengths, Failures, Trade-offs
Lindy (executive automation): Targets busy execs with natural language to agentic flows (vs. old Zapier-style wiring). Persistent memory: qualified yes (remembers queries/adjustments). Artifacts: no—opaque, hard to edit/debug. Compounding: unclear, texts for tweaks often fail, burns credits productively. Trustpilot: 2.4/5, complaints on runaway costs. Niche win: easier Zapier alternative for small annoyances, but not deep outcomes. Trade-off: simple signup/UI for execs sacrifices debugging.
Sauna (ex-Wordware, $30M raised): Pivoted from AI IDE after realizing users want outcomes, not building automations. "Cursor for knowledge work" with memory as substrate (foundational, not toggle), persistent browser logins, strong orchestration. Key insight: knowledge workers spec work clearly, no programming needed. Scores promising but unproven: memory yes (in theory), artifacts maybe, compounding yes (aimed at). Buzzy demos raise production doubts. Trade-off: early-stage ambition vs. real delivery.
"We're not really going to ask our knowledge workers to become programmers in the AI future instead we're going to recognize that our knowledge workers need to be clear enough about their work that they can write good spec"—Jones on Sauna's durable thesis for non-coders.
Google Opal (free Labs tool): Prompt-to-workflow with Gemini 1.5 Flash; self-corrects, routes tools, remixes public workflows (e.g., meeting prep). Zero barrier accelerates experimentation/open ethos. Memory: simplistic spreadsheet, not durable. Artifacts: limited. Compounding: basic. Trade-offs: free but fragile—Google kills experiments; data lock-in; lightweight only.
Obvious (AI workspace): Most ambitious—workbooks (SQL/charts), docs, presentations, Kanban, custom apps, cross-artifact links (slides reference spreadsheets). Pitches outcomes directly. Transcript cuts off, but positions as full replacement with editable, relational outputs. Potential high scores, least known.
"Even if the answer to these three hard questions is like one and a half or one and quarter out of three for co-work which is like the most mature version of these agents you still jump on it"—Jones on why imperfect Co-Work drives massive adoption.
No agent nails all three yet; demos fool, production exposes gaps.
Build vs Buy: 3-Layer Architecture for Control
Buy if niche fits (Lindy for exec tasks, Opal for free starts). Build for control: three-layer stack (details in full Substack). Leverage verifiability, memory substrate, artifact focus. Avoid hype—demand ROI via the 3 questions. Future: compounding agents replace SaaS for knowledge work, but only if foundations hold.
"Memory as a substrate not as a toggle compounding context"—Sauna's founder Philip Kireyev (via Jones), core to long-running agents.
Key Takeaways
- Test every agent on persistent memory, editable artifacts, compounding context—demand yes across all three.
- Start with code agents for verifiability lessons; apply to non-code (e.g., Co-Work's artifact strength).
- Lindy suits exec micro-tasks but debug opacity burns credits (2.4/5 Trustpilot).
- Sauna's memory-first pivot nails theory; watch for production proof post-$30M raise.
- Google Opal: free remixable workflows beat paid hype for prototyping.
- Build your stack: memory substrate + orchestration + specs from non-coders.
- Ignore demos; prioritize inspectability to avoid $285B-style overreactions.
- Compound context turns one-shots into outcomes—Sauna/Obvious lead here.
- Trade-off always: ease (Lindy/Opal) vs. depth (custom builds).