GPT-5.4 + Autoresearch Signal AI Self-Improvement
OpenAI's GPT-5.4 boosts workplace agent tasks to 83% on GDPval (surpassing GPT-5.2's 70.9%) while Karpathy's agents cut training time 11% autonomously, kickstarting closed-loop AI progress.
GPT-5.4 Prioritizes Reliable Workplace Agents Over Raw Intelligence
OpenAI's GPT-5.4, released March 5, integrates GPT-5.3-Codex coding strengths with native computer use, tool search, opt-in 1M-token context (272K default), and native compaction, enabling hour-long tasks without token bloat. Pricing rose to $2.50/$15 per million input/output tokens (Pro: $30/$180), but 70% token savings and efficiency offset costs—Mainstay tests show 95% first-try success on 30K portals, 3x faster. Benchmarks tie Gemini 3.1 Pro at 57 Intelligence Index, lead ProofBench/IOI/Vibe Code Bench but trail on GPQA/MMLU Pro/LiveCodeBench; workplace evals shine: 83% GDPval (vs 70.9% GPT-5.2 across 44 occupations), 87.3% spreadsheets, 75% OSWorld-Verified (beats human 72.4%), 82.7% BrowseComp (Pro 89.3%), 91% Harvey BigLaw, 33% fewer hallucinations. Monthly releases (GPT-5.2 Dec, 5.3-Codex/Spark/Instant Mar) shift gains to post-training, evals, tools—use steerable preambles in ChatGPT to redirect mid-task, but pair with Excel integrations for non-devs as OpenAI lags Claude Cowork's cross-app delegation. Microsoft counters by embedding Cowork tech into 365 Copilot for Word/Excel/Teams, leveraging distribution but risking execution lag.
Agent Swarms Unlock Economic Self-Improvement
Andrej Karpathy's autoresearch agent, run 2 days on nanochat proxy, discovered 20 code changes transferring from depth-12 to depth-24 models, slashing "Time to GPT-2" from 2.02 to 1.80 hours (11% gain) via optimizer tweaks, attention, regularization, data mixtures. Far beyond hyperparams, this automates bounded search in research: propose edits, run cheap proxies, validate, iterate. Labs will allocate GPU budgets for swarms testing thousands of experiments on mechanisms/optimizers/curricula, promoting winners to scale—humans oversee metrics, scale-ups. OpenAI notes GPT-5.3-Codex self-debugged training/deployments; GPT-5.4 falls short of "high" self-improvement (mid-career engineer level) but clears lower economic bar for prompts/evals/scaffolds, compounding progress. Expect year-end models where agents shape meaningful fractions of tweaks/data/post-training, with humans as architects.
Actionable Releases for AI Builders
Google's Gemini 3.1 Flash-Lite ($0.25/$1.50/M tokens) trades thinking levels (Minimal-High) for 2.5x TTFT/45% output speed in high-throughput multimodal tasks. Alibaba Qwen 3.5 Small (0.8B-9B) targets edge: 9B uses scaled RL for reasoning rivaling 5-10x larger models. Microsoft Phi-4-Reasoning-Vision-15B open-weights mid-fuses SigLIP-2 for visual math/science/GUI agents. Anthropic adds voice to Claude Code (/voice) for flow-state coding steers. Mistral's finance suite runs on-prem for compliance/search. Repos: Karpathy/autoresearch for agent ML experiments; Google Workspace CLI (40+ skills). Papers: Bayesian teaching boosts LLM belief updates; SkillNet ontology lifts agent rewards 40%/steps 30%; SoT prompting + T2S-Bench for text-to-structure. Build with these for agent pipelines—focus GDPval-like evals to measure real work replacement, not MMLU.