Inference Inflection: AI Compute Demand Explodes 10,000x

Inference Compute Becomes Strategic Bottleneck

NVIDIA CEO Jensen Huang identifies the 'inference inflection': AI now reasons, acts, and generates via inference, multiplying compute needs 10,000x for tokens alone and 1M x overall in two years. This flywheel—more capacity drives revenue and smarter AI—explains capacity crunches at OpenAI/Anthropic. Noam Brown calls inference compute 'strategic and undervalued'; Sam Altman positions OpenAI as an 'AI inference company.'

CPUs face shortages: Intel CEO reports surging non-GPU demand amid 5-6 year COVID refresh cycle, where AI budgets starved maintenance CapEx. Builders diverted funds to GPUs, but agents (Claude Code, production agents, RL sims like RL gyms/OpenClaw) demand CPUs—creating an up-slope after two years of underinvestment.

GPUs shift to prefill/decode disaggregation (Nvidia-Groq, Intel-Sambanova, Amazon-Cerebras), optimizing inference workloads as training yields to production token generation.

Build impact: Prioritize inference capacity planning; expect CPU constraints for agent sims/RL; disaggregate workloads for 10,000x scale without proportional hardware buys.

Agent Harnesses Unlock Production Gains Over Models

Raw model intelligence bows to harness quality—memory, retrieval, tools, orchestration. OpenAI's Codex evolves into persistent-context platform for code/research/spreadsheets/decisions: WebSocket Responses API cuts agent loops 40% via warm state; $0 seats for Business/Enterprise; integrations (Supabase, Figma→FigJam).

Cursor SDK exposes runtime/harnesses for CI/CD/embedded agents, shifting to usage-based programmable infrastructure. VS Code adds semantic indexing, cross-repo search, chat insights, skill context.

Agentic Harness Engineering iterates revertible components/evidence/predictions: Terminal-Bench pass@1 jumps 69.7%→77% (beats human Codex-CLI at 71.9%), transfers models, cuts SWE-bench tokens 12%. HALO self-patches via trace analysis: AppWorld 73.7→89.5 on Sonnet 4.6.

LangChain's Harness Profiles version model-specific prompts/tools; DeepAgents Deploy uses markdown/config + LangSmith tracing. Cloudflare exposes agent-accessible business flows (accounts/domains/paid plans).

Build impact: Engineer harnesses first—evals show 7-15% lifts vs. model swaps; use OSS harnesses/profiles for cost; deploy headless runtimes for automations.

Open Models + Kernels Push Edge/Enterprise Inference

Open-weight pressure: Mistral Medium 3.5 (dense 128B, vision reasoning, 64GB RAM local, 128K context) bets enterprise reliability over benchmarks. IBM Granite 4.1 (30B/8B/3B Apache 2.0) excels openness/token efficiency—8B uses 4M output tokens (vs. Qwen3.5 9B's 78M) for AA Intelligence Index 61. Ant Ling-2.6 (~107B MoE, MIT, 61.2 SWE-bench); Tencent Hy-MT1.5-1.8B-1.25bit (440MB phone translator, 33 langs/1056 directions, matches 235B via quantization). Pricing crashes: Qwen 3.5 Plus $3/M output; MiMo-V2.5 Pro shifts Code Arena Pareto at $1/$3/M.

Kernels optimize: Qwen FlashQLA delivers 2-3x forward/2x backward on long-context (TileLang, gate-CP, fused kernels) for edge agents. vLLM on Blackwell: DeepSeek V3.2 #1 at 230 tok/s/0.96s TTFT; Qwen 3.5 397B via NVFP4/EAGLE3. torch.compile details FX passes; GLM-5 postmortem fixes KV races/HiCache/LayerSplit for 132% prefill throughput.

IKP probes leak scale: factual accuracy (1,400 Qs/188 models/27 vendors) correlates R²=0.917 with params (135M-1.6T), rejecting 'knowledge compression.'

Build impact: Run open models locally/edge for cost (e.g., 64GB RAM vision); fuse kernels for 2x+ speed on long-context agents; probe black-boxes for hidden scaling.