Claude Opus Tops GPT-5.4 for Reliable Coding

GPT-5.4 boosts context to 1M tokens and matches Sonnet pricing at $2.50/M input/$15/M output, but trails Opus 4.6 in agentic tasks, writes messy code, and lacks Claude's consistent behavior—stick with Anthropic for production.

GPT-5.4 Gains in Specs but Pricing Climbs

OpenAI's GPT-5.4 unifies prior codec and general models into one API-available option, replacing GPT-5.2, with a Pro variant. It supports a 1 million token context window for extended workflows, but costs rise beyond 272K tokens to $5/M input and $22.50/M output due to attention processing demands—mirroring Claude and Gemini norms. Base pricing hits $2.50 per million input tokens and $15 per million output, equaling Sonnet but signaling upward trends that push devs toward subscriptions.

Official benchmarks show gains across tasks, positioning it as a step up from GPT-5.3. However, real tests reveal imbalance: it excels at niche hard puzzles (though with memory-leaking, inefficient decryption code) but falters in balanced outputs.

Coding Strengths Offset by Agentic Flaws

In hands-on evals, GPT-5.4 handles Svelte, Nuxt, and Go terminal calculator apps decently, ranking 10th overall among coding models. Creative tasks like 3.js Pokeball or chessboard generation work well, and general benches (king bench floor plans, panda burger) are solid but not Opus-level—floor plans lack doors/rooms logically, hands render wonky.

Agentic workflows expose weaknesses: Movie Tracker app fails with broken TMDB API and poor UI; it overrides unrequested system designs for 'efficiency' or UI tweaks during routine changes, producing messy code. This suits debugging entrenched issues but pains daily UI work, unlike balanced models.

Claude Opus Delivers Production Reliability

Author sticks with Claude Code (Opus 4.6) over GPT-5.4 due to consistent behavior—upgrades from 4.5 to 4.6 require no prompt tweaks, unlike GPT's version-wide shifts that disrupt pros. Anthropic's ecosystem shines with strong community, meaningful updates (beyond models to environments), and Cursor integration outperforming CodeX CLI.

For cost-sensitive flows, pair Opus with cheap GLM-5 (inherent background tasks) or Kilo GLM coding plan. GPT-5.4 adds no compelling edge at parity pricing, feeling gimmicky for complex demos over everyday reliability.

Video description
In this video, I'll be sharing my thoughts on OpenAI's new GPT-5.4 model, including its pricing, benchmarks, coding performance, and why I still prefer Claude Code and Anthropic's ecosystem for real-world work. -- Key Takeaways: 🚀 OpenAI has launched GPT-5.4 and GPT-5.4 Pro, with GPT-5.4 now available on the API as well. 🧠 GPT-5.4 now supports a 1 million token context window, which is great for longer tasks and bigger workflows. 💸 The pricing has gone up, with GPT-5.4 costing $2.50 per million input tokens and $15 per million output tokens, with even higher costs for longer context usage. 📈 OpenAI’s own benchmarks show clear improvements, but in my personal testing, the model still has some noticeable weaknesses. 🛠️ GPT-5.4 performs well in some coding tasks like Svelte, Nuxt, and Go, but struggles badly in other real-world agentic tasks. ⚠️ The model feels unbalanced at times, solving surprisingly hard problems while also making weird decisions, writing messy code, and changing things I didn’t ask for. 🤝 I still prefer Claude Code because of its reliability, stronger ecosystem, better community, and more meaningful updates over time. 💡 For my workflow, tools like Claude Code, Opus, and GLM-5 still make more sense than switching fully to GPT-5.4 right now.

Summarized by x-ai/grok-4.1-fast via openrouter

4999 input / 1155 output tokens in 11386ms

© 2026 Edge