AI News: Spud, Conway Agent, Cursor 3, Gemma 4 Drops

Frontier Closed Models Push Reasoning and Multimodality

OpenAI's Spud, internally labeled GPT-5.5 or potentially GPT-6, prioritizes raw intelligence over projects like Sora, targeting spring 2026 release. Greg Brockman describes it as having 'big model smell'—more intuitive adaptation to user intent, complex long-term reasoning, and flexible handling of tasks beyond fine-tunes. A trusted source notes improvements over GPT-5.4 but not matching top tiers like Anthropic's models yet. Separately, OpenAI's GPT-Image-2 checkpoint on Arena (under codenames masking tape alpha, gaffer tape alpha, packing tape alpha) excels in world knowledge, near-perfect text rendering, and replicating specifics like doctor notes or company logos—testable now in Arena's battle mode.

Anthropic's Conway is an always-on agent running in its own UI instance, automating browsers via connectors and Claude Code, with webhook triggers and extensibility via upcoming CNW ZIP for custom tools, UI tabs, and context handlers. Claude Code's new /ultraplan (via command, prompt, or web refine) shifts detailed planning to browser-based cloud execution for better design alignment before local implementation, browser reviews for readability, and flexible remote/local execution—currently research preview. Anthropic also integrates Deepgram Nova-3 for voice, expanding Claude to multimodal speech understanding/generation, likely in next releases like Mythos or Sonnet 5.

Coding IDEs and Agent Workflows Evolve

Cursor 3 redesigns for agent-heavy coding, running multiple agents locally, via SSH, or cloud, with a separate window surfacing editor features contextually to complement full IDEs. Anthropic Pro/Max plans ($20-200/month) end third-party tool coverage (e.g., OpenClaw) from April 4, requiring extra billing; users get one-time credits equal to subscription value and pre-purchase discounts, ending arbitrage where $200 plans ran thousands in workloads.

Open Models Excel in Context, Speed, and Benchmarks

Alibaba's Qwen 3.6-Plus delivers 1M token context, 78.8 on SWEBench (vs. Claude 3 Opus at 80.9), outperforming Opus on most benchmarks with stronger coding, cheaper pricing, image/screen understanding like a real user, and reliability in real-world tasks. Google's Gemma 4 family (Apache 2.0), built on Gemini 3 research, supports multimodal inputs (text, images, audio, video), long context, reasoning/coding; ranks #3 on Arena. The 2B variant runs on iPhone 17 Pro at ~40k tokens/sec via MLX optimization, enabling on-device multimodal AI. DeepSeek V4 launches in weeks, first frontier Chinese model native on Huawei Ascend chips—Alibaba, ByteDance, Tencent ordering thousands, with prices up 20%; signals China's reduced NVIDIA dependency using domestic compute stacks now viable at scale.

These updates highlight accelerating agent automation, on-device feasibility, and hardware diversification, with specific benchmarks and access points for immediate testing.