GPT 5.5 Tops Opus 4.7 and DeepSeek V4 in Coding Benchmarks

Cost Trade-offs Favor DeepSeek, But Performance Doesn't

DeepSeek V4, a 1.6T parameter open-weight model, undercuts competitors by 8x on API costs: $3.48 per million output tokens vs. $30 for GPT 5.5 and $25 for Opus 4.7; input is $1.70 vs. $5. Despite GPT 5.5 doubling 5.4's price, OpenAI claims 20% effective cost increase due to fewer tokens needed. Opus lags in long-context retrieval (500k-1M tokens), regressing from 4.6.

Benchmarks show tight races: Opus leads SWE-bench Verified (86%) and SWE-bench Pro, but GPT 5.5 crushes TerminalBench 2.0 at 87.2% (beating Anthropic's internal Mythos). DeepSeek V4 trails (e.g., 85% SWE-bench Verified) but stays within 1-5 points of leaders at fraction of cost. "V4 Pro is always third place... five points isn't nothing... but again, eight times cheaper."

Real-world viability questions benchmarks: context rot hits all models beyond 500k tokens, and gaps shrink for cost-sensitive users.

GPT 5.5 Excels in Iterative 3D Flight Simulator Builds

Task: Browser-based Three.js flight sim with realistic physics, islands/ocean terrain, toggleable cameras, strong visuals. All models use identical skills/harnesses (Codeex for GPT, Cloud Code for Opus, Open Code for DeepSeek); evaluated on time, tokens, quality, "vibes."

GPT 5.5 (Codeex): First-pass in 7min/63k tokens yields playable sim with AOA/speed/altitude HUD, clouds, grass runway. Iteration 1 ("easier to fly, better graphics") improves visuals; Iteration 2 fixes brakes/flaps for takeoff success, rings to fly through, accurate instruments (knots, heading, V/S). Total: 15min/66k tokens (~quarter Opus cost). Controls janky but functional; kamikaze climbs hit 18k ft/min.

DeepSeek V4 (Open Code): 10min/63k tokens first-pass is "utter disaster"—buggy graphics, unrecognizable plane/cockpit. Iteration yields chaotic mess; needs hyper-specific restarts. Total: longer/130k tokens/$0.44, zero usability.

Opus 4.7 (Cloud Code): Detailed 5min plan (stalls, controls, tricycle gear) +13min build/150k tokens first-pass slingshots into stall/clouds. Iterations add arcade controls/runway spawn but persist fog/trees/instant dives; subtle instruments. Total: 20min/200k+ tokens. "Has the actual things we needed vs Deepseek... but struggled."

GPT wins decisively: vague prompts yield flyable result fast/cheap; Opus second (thorough but slow/overkill); DeepSeek unusable.

WebGPU Shader Landing Pages Test Creative Limits

Task: Awards-style page (e.g., Igloo) with Three.js/WebGPU shaders, mouse-reactive GPU compute, modern hero. Shared shader skill provided.

GPT 5.5: 6min/107k tokens builds full-bleed particle field (signal/dense), pointer-reactive, bloom/aberration. Too bright/overpowers text; iteration tones down, shifts right for readability. Blurry but effective animation/color shifts.

Opus 4.7: ~175k tokens builds understated WebGL background (250k particles, film grain/blur, FPS tracker). Subtle top-bottom gradient; iteration adds minor flashiness. "Cool... just not super flashy."

DeepSeek V4: Longest build/130k tokens/$1.43 for epileptic particle field, color-shifting text, weak mouse follow. Iteration adds parallax/UFO blob/blue BG—bland, seizure-risky.

GPT edges for balance; Opus tasteful subtlety; DeepSeek gimmicky failure. Plans converge on particles despite variety.

Practical Model Selection: Power vs. Price

GPT 5.5 proves robust across metrics—beats Opus in speed/quality/cost efficiency, laps DeepSeek. Handles iterations intuitively without hand-holding. Opus shines in planning depth but bloats tokens/time for marginal gains. DeepSeek tempts budgets yet demands restarts, unfit for complex visuals/physics.

"GPT 5.5 easily the winner... quarter the cost and... a bit faster." For production coding (e.g., 3D web apps), prioritize GPT unless pure cost rules out quality. Benchmarks hint at viability, but hands-on reveals gaps: realistic sims favor arcade tweaks over hardcore physics.

Notable Quotes:

"While it's double the price of 5.4, they say... it ends up only being like 20% more expensive when it's all said and done."
"Opus wins, but... V4 is always third place... isn't the huge gap you would expect. I mean, five points isn't nothing... eight times cheaper."
"This is brutal... I feel like even giving it another prompt... I would need to start getting very, very specific."
"For 66,000 tokens, about 10 minutes... I don't think that's bad at all."
"GPT 5.5 did much much better... right off the rip, with pretty vague prompts."

Key Takeaways

Default to GPT 5.5 for coding tasks needing quality/speed; its token efficiency offsets higher per-token cost.
Use DeepSeek V4 only for simple, cost-capped prototypes—expect bugs/graphics failures in visuals/physics.
Opus 4.7 suits detailed planning but cut iterations to curb 3x token bloat vs. GPT.
Start prompts arcadey for flyable sims; realistic physics demands user-friendly overrides.
Benchmarks overstate gaps—test real tasks; 1-5pt differences amplify at 8x cost savings.
Equip agents with shared skills (e.g., shaders) for fair comparisons; plan mode elicits similar structures.
Track time/tokens/vibes: GPT hit 15min/66k for flyable sim; scale expectations accordingly.
Avoid long-context (>500k) reliance—regression hits Opus hard.