MiniMax M2.7 Self-Evolves to Rival Closed Coding Models
Open-source MiniMax M2.7 uses MoE and self-evolution to hit 56.2% on SWE-Pro, outperforming GPT-4o in engineering tasks while handling office work and multi-agent flows with 30% self-boost.
Self-Evolution Unlocks Elite Coding and Debugging
MiniMax M2.7, an open-source Mixture-of-Experts (MoE) model, activates only relevant parts per query for efficiency, targeting software engineering, office tasks, and multi-agent coordination. It scores 56.22% on SWE-Pro (matching GPT-4o Codex level for log analysis, bug triage, security reviews, ML fixes), 57.0% on Terminal Bench 2, 39.8% on NL2 Repo (full codebase understanding), and 55.6% on Vibe Pro (repo-level generation across web/Android/iOS/simulations, near Claude 3.5 Sonnet). Multilingual engineering hits 76.5% on SWE-Multilingual and 52.7% on MultiSWE-Bench.
In production debugging, it correlates monitoring spikes with deployments, analyzes traces/databases, spots issues like missing indexes, and suggests fixes—cutting recovery to under 3 minutes, mimicking SRE behavior. Self-evolution shines: over 100 autonomous rounds, it analyzes failures, tunes scaffolds (e.g., temperature/penalties, cross-file bug checks, loop detection), yielding 30% internal performance gains. On MLE-Bench Light (22 ML competitions on A30 GPU), with self-feedback/optimization over 24-hour runs, it earns 9 golds, 5 silvers, 1 bronze (66.6% average medal rate, tying Gemini 2.0 Flash). Internally, it automates 30-50% of RL team workflows.
For office work, 1,495 ELO on GDP-Val AA (top open model, behind only Claude 4o/GPT-4.1), 46.3% on Toolathon, 97% skill compliance/62.7% accuracy on MM-Claw (40+ skills >2k tokens), plus finance tasks like report analysis, forecasting, PowerPoint generation.
Delegated Agents Shift AI to Background Execution
Runnable's RunClaw embeds cloud agents in Slack/Telegram/Discord for delegated tasks: it clarifies intent, plans, executes without iteration loops. Built on Runnable's platform (generates sites/videos/decks with DB/Stripe/SEO/analytics/AI voice agents; integrates Google/Slack/Notion/GitHub/Shopify; memory for styles), it signals agent race maturity—Runnable hit $2M ARR with daily updates.
OpenAI's unified Codex app merges ChatGPT/Atlas/coding into one, with Scratchpad for parallel tasks and managed agents (background multi-step via heartbeat persistence, like o1). Reduces tool-switching for end-to-end goals.
Multimodal Parallel Reasoning and Voice Workspaces
Meta's native multimodal Muse Spark (from Super Intelligence Labs) excels via pre-training (10x compute-efficient vs. Llama 3.1), RL (steady pass@1/16 gains), and test-time reasoning with thought compression (fewer tokens, higher perf). Contemplating mode runs parallel agents for refinement, scoring 58.4% on Humanity's Last Exam (tools), 38.3% Frontier Science (beats GPT-4.1 Pro), 42.8% Health Bench Hard (beats Claude 4o Max 14.8%; 1,000+ physician-curated data). UI detection: 72.2%/84.1% Screen Spot Pro (beats Claude/GPT). Coding 77.4% SWE-Bench Verified; weak on ARC-AGI 42.5%.
Google Mixboard evolves to voice-controlled workspace (notes, stickers/shapes/markers + images, like Miro/FigJam): full speech for generation/rearranging (via Stitch infra), PDF export for brainstorming-to-docs. Experimental, possible Gemini/Workspace integration at I/O (May 19-20).