OpenAI's GPT-OSS: Open-Weight MoE Models for Local Agents
OpenAI releases Apache 2.0 gpt-oss-120B/20B MoE models (2.1M H100 hours training) runnable on 60GB desktop/12GB phone GPUs for o4-mini reasoning; Anthropic's Claude 4.1 Opus tops coding; DeepMind Genie 3 simulates realtime worlds for 1+ minutes.
GPT-OSS Delivers Agentic Reasoning to Consumer Hardware
OpenAI's first open-weight models since GPT-2, gpt-oss-120B and gpt-oss-20B, use Mixture-of-Experts (MoE) architecture with wide-vs-deep design, bias units in attention, and a custom swiglu variant. Trained on 2.1 million H100 GPU hours, they match o4-mini reasoning for agentic tasks under Apache 2.0 license. Run 120B on 60GB GPUs (desktops) or 20B on 12GB (phones); community tests show 30 tokens/sec on gpt-oss-120B, 17 TPS quantized, and >45 t/s via llama.cpp's -n-cpu-moe on GGUF. Test via gpt-oss.com playground. New Harmony format (open-sourced on GitHub) updates ChatML with message channels for structured outputs.
Trade-offs: Strong on reasoning but risks hallucinations, per @sama, @rasbt, @SebastienBubeck evaluations. Model card notes ~117B active params for 120B variant.
Claude 4.1 Opus and Genie 3 Push Coding and Simulation Limits
Anthropic's Claude 4.1 Opus claims world's best coding performance, leaked pre-announce, with community benchmarks confirming dominance (e.g., LMArena discussions). DeepMind's Genie 3 generates realtime world simulations with navigation and 1-minute+ consistency, though demos may be cherry-picked.
Builders gain local coding agents without API costs and simulation tools for game/embodiment prototypes, but expect eval needed for production reliability.
Community Benchmarks Highlight Run Local, Eval Hard
Reddit/Discord recaps (r/LocalLlama, Unsloth, LM Studio, etc.) focus on quantization (GGUF Q5_k_m), inference speeds (45 sec/image for Qwen-Image at 1024 res, CFG=4/20 steps), and integrations (llama.cpp MoE offload at 45 t/s). Discussions flag OpenAI OSS limits vs. closed models, GPU needs (NVMe offload, multi-GPU CUDA), and tools like LM Studio 0.3.21 supporting gpt-oss. EleutherAI notes scaling insights; Hugging Face covers HF CLI accelerations. Key: Prioritize GGUF for desktop deployment, benchmark your agent workflows—e.g., -n-cpu-moe 2 boosts CPU MoE to 45 t/s.