Vertical Models Beat Frontiers via Experience Data

Domain Post-Training Closes Performance Gaps

Specialized post-training on high-quality, proprietary interaction data vaults open-weight base models past frontier LLMs for vertical tasks. Intercom's Apex, built for customer service, achieves 28% higher resolution rates, 65% fewer hallucinations, faster inference, and lower costs than GPT-4.5 or Claude-3.5 Opus, leveraging billions of human-agent interactions for evals and fine-tuning. Cursor's Composer 2 starts from Qwen 2.5 (open-source), applies reinforcement learning with 75% of compute on post-training, beats Opus 4.6 on coding benchmarks while costing less to run. Decagon routes 80% of traffic through a network of in-house specialized models for detection, orchestration, response, and evaluation, optimizing each layer independently for speed and quality. This flywheel—usage data refines evals, evals improve models, better models generate more data—creates compounding edges unavailable to generalists.

Bitter Lesson Evolves: Experience Trumps Brute-Force Scale

Rich Sutton's 2019 Bitter Lesson holds: general compute/data methods beat human-knowledge encodings across chess, Go, vision, speech, and language. BloombergGPT (50B params, finance-tuned from scratch) lost to larger generals, proving specialization alone fails without scale. But limits on pre-training data shift emphasis to post-training, where vertical firms' 'last-mile' experience data (millions of real interactions) acts as scalable, non-human-knowledge fuel. Sutton himself predicted this on Dwarkesh Patel's podcast: systems learning from experience will supersede human-injected knowledge, extending the lesson. Unlike early domain hacks, this uses brute-force learning on experiential data, aligning with Sutton's thesis while enabling smaller, task-speciated models (per Andrej Karpathy's analogy to animal brain diversity).

Disruption Forces Full-Stack AI and Model Moats

API reliance erodes as vertical SaaS firms (Pinterest, Airbnb, Notion, Cursor, Intercom, hundreds more) train in-house open models that are better, faster, cheaper than vendor APIs—echoing cloud markup shifts. Durable differentiation migrates down-stack: app-layer clones easily, but proprietary evals/data lock in model superiority. Frontier labs (OpenAI, Anthropic) over-serve with general intelligence unnecessary for niches like coding or CS; open-weights suffice as bases. Labs must counter via specialized models, data partnerships, or M&A for evals—not every firm succeeds (needs post-training expertise), but those with scale (e.g., Cursor's financial need to cut API burn) will experiment aggressively. Result: hyperspecific providers compete head-on, majority of workflows go in-house/open-source over APIs.