Notion's 5 Agent Rebuilds to Software Factories

Mastering Model Timing and Multiple Rebuilds

Sarah Sachs and Simon Last explain how Notion's Custom Agents took 3.5 years and 4-5 full rebuilds to launch reliably. Early 2022 attempts failed due to no tool-calling standards, short context windows (e.g., pre-GPT-4), and unreliable models. Simon Last recounts partnering with Frontier Labs and OpenAI to fine-tune function-calling on Notion tools, but models were "too dumb." Glimmers of success kept them going, but production robustness eluded them until Sonnet 3.5/3.7 last year unlocked their first agent ship, with Custom Agents refined further for background reliability.

Sarah emphasizes intuition to avoid "swimming upstream" against model limits: quickly pivot from futile fine-tunes to building product infrastructure. They balanced shipping useful features for their massive user base while pursuing "AGI-pilled" bets. "We try to take a portfolio approach," Simon says—maintain shipped products, iterate winners, and chase crazy moonshots like coding agents as the "kernel of AGI."

A live demo showcases a Custom Agent triaging coworking tenant emails: it enriches applicants via web search, structures data into Notion databases, and runs autonomously. This highlights progressive tool disclosure—hiding complexity until needed—and agent self-setup, where agents inspect failures and edit instructions within permission guardrails.

Agent Lab Thesis: Product Systems Over Wrappers

Notion embodies the "Agent Lab" playbook (cited repeatedly by Sarah, who shares it in interviews): don't just wrap models; build collaboration systems around frontier capabilities. Sarah analogies Notion to Datadog on AWS—leveraging LLMs as infrastructure while excelling at user journeys like email triaging or PDF exports that demand sandboxed code execution.

Horizontal like Notion means edge expertise: decompose broad customer asks into reusable primitives (shared databases as memory, pages for state). Agents compose via manager agents overseeing dozens of specialists, invoking each other seamlessly. Simon's "Simon Vortex"—hackathons pulling in security early—fuels prototypes. Everyone uses Notion daily, so "demos over memos" accelerates validation; prototypes now build faster with mature models.

"Notion is about being the best place for you to collaborate," Sarah says. They focus user journeys (e.g., P99 token-heavy transcripts dissected Fridays) over cool tools, ensuring agents handle real work like Meeting Notes' growth loop: transcription captures high-signal data for search, agents, and workflows.

Low-Ego AI Engineering and Org Design

Sarah runs "Token Town" (her X notes on AI leadership) with objective-setting over idea ownership: low-ego teams delete their work swarm fast-changing opportunities. No single idea person; collective intuition spots river flows. Simon's lists of internal agents ("no humans ever read it") show scale—agents for everything from specs to PRs.

Org splits into core AI infra/packaging, product teams, and company-wide mandate: every surface works for humans and agents. Model Behavior Engineers (new role) write evals, analyze failures, understand behaviors—distinct from software engineering. Evals include regression/launch-quality tests and "frontier/headroom" ones passing ~30% to track model progress (e.g., Notion’s Last Exam).

Software engineers evolve: less typing code, more supervising agent loops (specs → self-verification → bug flows → subagents). Simon bullish on CLI over MCP for self-debugging; Sarah weighs determinism, permissions, pricing. MCP for native integrations (e.g., Gmail), custom for power.

Evals as Agent Harnesses and Retrieval Focus

Evals double as harnesses: test agent reliability end-to-end. History of harnesses: JS coding agents → custom XML → Markdown/SQL abstractions → tool defs with short system prompts. They teach "top of the class" power users, exposing capability without over-abstraction.

No rush to train frontier models; fine-tune optimizations suffice. Big bet: agent-native retrieval/ranking, as searches shift from humans. Meeting Notes excels by structuring collaboration data, fueling agents over hardware plays—Notion as system of record, open to wearables.

Pricing: credits abstract tokens + search + future sandboxes; usage-based post-free trials (most successful launch). "Auto" matches models to tasks.

"Coding agents feel like the kernel of AGI," Simon notes, teasing software factories: agents spec, code, test, debug, review codebases with minimal humans preserving invariants.

Key Takeaways

Rebuild ruthlessly: Kill efforts swimming upstream model limits; portfolio-balance maintenance, iterations, and moonshots.
Build Agent Labs: Product intuition + primitives (databases/pages) around LLMs for collaboration, not wrappers.
Evals first: Regression, launch, and 30% frontier tests; Model Behavior Engineers analyze failures.
Compose agents: Managers + specialists + shared state; CLI for debug, MCP for integrations.
Price for reality: Usage-based credits covering tokens/tools; free trials convert.
Future-proof: Software factories via coding agents; retrieval for agent searches; data capture loops like Meeting Notes.
Low-ego teams: Objectives > ownership; demos > memos; security early.
User journeys guide: Triangulate P99 failures; primitives from real needs (e.g., PDF → sandbox).
Horizontal scale: Edge expertise via reusable blocks; every surface agent/human-ready.
Intuition skill: Spot model rivers, build ahead—e.g., agents ready when Sonnet unlocked.

Notable quotes:

"Coding agents are the kernel of AGI... your agent can bootstrap its own software and capabilities." — Simon Last, on moonshot directions.

"The trick is to not fine-tune futilely for too long, but realize there was something there... not swimming upstream." — Sarah Sachs, on timing rebuilds.

"Demos over memos changes product development inside a tool the whole company already uses." — Simon Last, on internal velocity.

"We’re experts in understanding how people wanna collaborate, regardless of the tools." — Sarah Sachs, Datadog analogy.

"No humans ever read the agent list... it’s super exciting." — Simon Last, on internal agent scale.