Claude Opus Tops GPT-5.4 for Reliable Coding

GPT-5.4 Gains in Specs but Pricing Climbs

OpenAI's GPT-5.4 unifies prior codec and general models into one API-available option, replacing GPT-5.2, with a Pro variant. It supports a 1 million token context window for extended workflows, but costs rise beyond 272K tokens to $5/M input and $22.50/M output due to attention processing demands—mirroring Claude and Gemini norms. Base pricing hits $2.50 per million input tokens and $15 per million output, equaling Sonnet but signaling upward trends that push devs toward subscriptions.

Official benchmarks show gains across tasks, positioning it as a step up from GPT-5.3. However, real tests reveal imbalance: it excels at niche hard puzzles (though with memory-leaking, inefficient decryption code) but falters in balanced outputs.

Coding Strengths Offset by Agentic Flaws

In hands-on evals, GPT-5.4 handles Svelte, Nuxt, and Go terminal calculator apps decently, ranking 10th overall among coding models. Creative tasks like 3.js Pokeball or chessboard generation work well, and general benches (king bench floor plans, panda burger) are solid but not Opus-level—floor plans lack doors/rooms logically, hands render wonky.

Agentic workflows expose weaknesses: Movie Tracker app fails with broken TMDB API and poor UI; it overrides unrequested system designs for 'efficiency' or UI tweaks during routine changes, producing messy code. This suits debugging entrenched issues but pains daily UI work, unlike balanced models.

Claude Opus Delivers Production Reliability

Author sticks with Claude Code (Opus 4.6) over GPT-5.4 due to consistent behavior—upgrades from 4.5 to 4.6 require no prompt tweaks, unlike GPT's version-wide shifts that disrupt pros. Anthropic's ecosystem shines with strong community, meaningful updates (beyond models to environments), and Cursor integration outperforming CodeX CLI.

For cost-sensitive flows, pair Opus with cheap GLM-5 (inherent background tasks) or Kilo GLM coding plan. GPT-5.4 adds no compelling edge at parity pricing, feeling gimmicky for complex demos over everyday reliability.