Codex Targets Knowledge Work, Claude Creatives & Agents Evolve
Codex upgrades enable non-coders to automate computer tasks 42% faster with dynamic UI and integrations; Claude adds creative app support like Blender/Adobe; GPT-5.5 closes cyber eval gap to 71.4% pass rate vs Claude Mythos' 68.6%, signaling agent capabilities maturing across domains.
Agents Expand Beyond Coding into Workflows
OpenAI's Codex shifts from code to general knowledge work, pitching it as a "SuperApp" for any computer task via role-based onboarding, Microsoft/Google/Salesforce integrations, and a planning UI resembling Cowork. Key upgrades include 42% faster Computer Use Agent (CUA), responsive browser control, /chronicle for task history, /goal for Ralph-loop planning, dynamic task-specific UI (rejecting fixed toggles), and in-app MS Office editing. Sam Altman urges trying it for non-coding work, emphasizing productized agent UX over raw model power—enabling non-coders to automate docs, slides, spreadsheets, research without clunky handoffs.
Anthropic counters with Claude Security, a repo vulnerability scanner using Opus 4.7 to validate issues and suggest fixes, plus Cursor's parallel Security Review for PRs/codebase scans. Claude also integrates creative tools (Blender, Autodesk, Adobe CC, Ableton, Canva), positioning it for creative workflows amid rising security risks.
Tradeoffs: Dynamic UI routes experiences intelligently but risks inconsistency; integrations boost enterprise fit but raise security concerns like vulnerabilities in cyber evals.
Benchmarks Show Efficiency Gains, Closing Capability Gaps
GPT-5.5 achieves cyber parity with Claude Mythos Preview on UK AI Safety Institute evals: 71.4% average pass rate (vs 68.6%), solving TLO chain in 2/10 attempts (vs 3/10), with gains past 100M token inference without saturation. It sets SOTA on CritPt at ~60% lower cost/token use than GPT-5.4 Pro, prioritizing reliability/efficiency for high-value tasks over intelligence jumps.
Open-weights advance: Qwen3.6 27B leads <150B params (Intelligence Index 46, Apache 2.0, 262K context, multimodal, fits H100) but costs 21x Gemma 4 31B on inference due to 144M output tokens. Tencent Hy3-preview (295B/21B MoE, 42 index) excels on CritPt (4.6%); Grok 4.3 jumps to 53 index/1500 Elo on GDPval-AA at 40-60% lower prices; Ling 2.6 1T (34 index, $95 run) trades reliability (92% hallucination) for cost.
Impact: Use Qwen3.6 for size-efficient open agents; GPT-5.5/Grok for production cyber/workflows where cost matters more than frontier scores.
Infra Shifts to Harness Engineering and Multi-Agent Systems
Agent building converges on harness over models: Cursor details runtime evals, degradation fixes, model-specific prompts/tools, mixed offline/online testing, dogfooding context windows. LangChain's DeepAgents deploy offers config-driven (deepagents.toml) cloud infra for multi-tenant agents with sandboxing/RBAC/data isolation.
Collaborative agents emerge via Agent Collabs (Hugging Face buckets/Spaces for message/artifact sharing), letting weak agents validate while strong ones experiment. DeepSeek's vision ties to computer-use via bounding boxes/primitives for GUI/browser tasks.
Security risks spike: PyPI 'lightning' 2.6.2/2.6.3 compromised (steals creds via Bun/JS payload); Anthropic's 1M convo study informs Opus 4.7/Mythos training against sycophancy. Qwen-Scope provides open interpretability for steering/debugging.
Practical shift: Build agents with bespoke harnesses for reliability; deploy multi-tenant via LangChain for enterprise; monitor supply-chain attacks to protect pipelines.