AI Agents Reshape Work via Exponential Gains

Exponential AI Progress Powers Autonomous Agents

AI capabilities follow exponential curves across benchmarks, enabling agents to replace hours of human work with minutes of output. The Otter Test illustrates this: from incoherent 2022 images of 'otter on a plane using wifi' to near-perfect 2025 renders, and now ByteDance's unreleased video model produces a full documentary on otters critiquing the test—complete with human-like expressions and accurate narration (one pronunciation error). METR's Long Tasks benchmark shows agents autonomously completing extensive work reliably. Four diverse tests confirm the pattern: Google-Proof Q&A (grad students score 34% outside field, 70% inside; top AIs hit 94%); GDPval (industry experts judge AI vs. humans on complex tasks; latest AIs match top humans 82% of time); Humanity’s Last Exam (professor-created hard problems); PPBench puzzles. Despite jaggedness—high skill in some areas, failures in others—this lets you delegate tasks to agents like Claude Code, OpenAI Codex, or OpenClaw, moving from back-and-forth prompting (co-intelligence) to oversight (managing AIs).

Software Factories Demonstrate Radical Work Redesign

Organizations experiment with AI to eliminate human coding roles, using agents for end-to-end production. StrongDM's three-person team built a Software Factory: humans write roadmaps; coding agents build software, testing agents simulate customer environments and iterate via feedback loops until AI deems it ready. Key rules—no human-written or reviewed code. Each engineer spends $1,000/day on AI tokens (salary equivalent). Finished products ship to customers without humans seeing code. Details like Slack twins for agent coordination make it viable; observers like Simon Willison and Dan Shapiro note strengths (speed) and weaknesses (edge cases). This works because agents hit production thresholds, forcing reevaluation of team structures—prioritize roadmap thinkers over coders.

Rolling Disruptions and RSI Amplify Instability

Threshold-crossing capabilities trigger sudden shifts in markets, jobs, and policy. One February week previewed this: Citrini Research's fictional 2028 AI scenario shook stocks; Block's 40% layoffs (AI cited, likely cover); Pentagon-Anthropic clash over Claude's government use rules. AI labs pursue recursive self-improvement (RSI): Anthropic engineers rarely code manually; OpenAI's Codex 'instrumental in creating itself'; Google DeepMind closing the loop despite risks. RSI could steepen exponentials, but faces compute/data/research bottlenecks or LLM ceilings. Act now—experiment with agents to set precedents, as early choices shape AI's integration into work, education, and governance before instability peaks.