GPT 5.5 Tops Opus 4.7 and DeepSeek V4 in Coding Benchmarks
GPT 5.5 delivers superior quality and speed for building interactive 3D web apps like flight sims and GPU shaders, outperforming pricier Opus and cheaper-but-flawed DeepSeek V4.
Cost Trade-offs Favor DeepSeek, But Performance Doesn't
DeepSeek V4, a 1.6T parameter open-weight model, undercuts competitors by 8x on API costs: $3.48 per million output tokens vs. $30 for GPT 5.5 and $25 for Opus 4.7; input is $1.70 vs. $5. Despite GPT 5.5 doubling 5.4's price, OpenAI claims 20% effective cost increase due to fewer tokens needed. Opus lags in long-context retrieval (500k-1M tokens), regressing from 4.6.
Benchmarks show tight races: Opus leads SWE-bench Verified (86%) and SWE-bench Pro, but GPT 5.5 crushes TerminalBench 2.0 at 87.2% (beating Anthropic's internal Mythos). DeepSeek V4 trails (e.g., 85% SWE-bench Verified) but stays within 1-5 points of leaders at fraction of cost. "V4 Pro is always third place... five points isn't nothing... but again, eight times cheaper."
Real-world viability questions benchmarks: context rot hits all models beyond 500k tokens, and gaps shrink for cost-sensitive users.
GPT 5.5 Excels in Iterative 3D Flight Simulator Builds
Task: Browser-based Three.js flight sim with realistic physics, islands/ocean terrain, toggleable cameras, strong visuals. All models use identical skills/harnesses (Codeex for GPT, Cloud Code for Opus, Open Code for DeepSeek); evaluated on time, tokens, quality, "vibes."
GPT 5.5 (Codeex): First-pass in 7min/63k tokens yields playable sim with AOA/speed/altitude HUD, clouds, grass runway. Iteration 1 ("easier to fly, better graphics") improves visuals; Iteration 2 fixes brakes/flaps for takeoff success, rings to fly through, accurate instruments (knots, heading, V/S). Total: 15min/66k tokens (~quarter Opus cost). Controls janky but functional; kamikaze climbs hit 18k ft/min.
DeepSeek V4 (Open Code): 10min/63k tokens first-pass is "utter disaster"—buggy graphics, unrecognizable plane/cockpit. Iteration yields chaotic mess; needs hyper-specific restarts. Total: longer/130k tokens/$0.44, zero usability.
Opus 4.7 (Cloud Code): Detailed 5min plan (stalls, controls, tricycle gear) +13min build/150k tokens first-pass slingshots into stall/clouds. Iterations add arcade controls/runway spawn but persist fog/trees/instant dives; subtle instruments. Total: 20min/200k+ tokens. "Has the actual things we needed vs Deepseek... but struggled."
GPT wins decisively: vague prompts yield flyable result fast/cheap; Opus second (thorough but slow/overkill); DeepSeek unusable.
WebGPU Shader Landing Pages Test Creative Limits
Task: Awards-style page (e.g., Igloo) with Three.js/WebGPU shaders, mouse-reactive GPU compute, modern hero. Shared shader skill provided.
GPT 5.5: 6min/107k tokens builds full-bleed particle field (signal/dense), pointer-reactive, bloom/aberration. Too bright/overpowers text; iteration tones down, shifts right for readability. Blurry but effective animation/color shifts.
Opus 4.7: ~175k tokens builds understated WebGL background (250k particles, film grain/blur, FPS tracker). Subtle top-bottom gradient; iteration adds minor flashiness. "Cool... just not super flashy."
DeepSeek V4: Longest build/130k tokens/$1.43 for epileptic particle field, color-shifting text, weak mouse follow. Iteration adds parallax/UFO blob/blue BG—bland, seizure-risky.
GPT edges for balance; Opus tasteful subtlety; DeepSeek gimmicky failure. Plans converge on particles despite variety.
Practical Model Selection: Power vs. Price
GPT 5.5 proves robust across metrics—beats Opus in speed/quality/cost efficiency, laps DeepSeek. Handles iterations intuitively without hand-holding. Opus shines in planning depth but bloats tokens/time for marginal gains. DeepSeek tempts budgets yet demands restarts, unfit for complex visuals/physics.
"GPT 5.5 easily the winner... quarter the cost and... a bit faster." For production coding (e.g., 3D web apps), prioritize GPT unless pure cost rules out quality. Benchmarks hint at viability, but hands-on reveals gaps: realistic sims favor arcade tweaks over hardcore physics.
Notable Quotes:
- "While it's double the price of 5.4, they say... it ends up only being like 20% more expensive when it's all said and done."
- "Opus wins, but... V4 is always third place... isn't the huge gap you would expect. I mean, five points isn't nothing... eight times cheaper."
- "This is brutal... I feel like even giving it another prompt... I would need to start getting very, very specific."
- "For 66,000 tokens, about 10 minutes... I don't think that's bad at all."
- "GPT 5.5 did much much better... right off the rip, with pretty vague prompts."
Key Takeaways
- Default to GPT 5.5 for coding tasks needing quality/speed; its token efficiency offsets higher per-token cost.
- Use DeepSeek V4 only for simple, cost-capped prototypes—expect bugs/graphics failures in visuals/physics.
- Opus 4.7 suits detailed planning but cut iterations to curb 3x token bloat vs. GPT.
- Start prompts arcadey for flyable sims; realistic physics demands user-friendly overrides.
- Benchmarks overstate gaps—test real tasks; 1-5pt differences amplify at 8x cost savings.
- Equip agents with shared skills (e.g., shaders) for fair comparisons; plan mode elicits similar structures.
- Track time/tokens/vibes: GPT hit 15min/66k for flyable sim; scale expectations accordingly.
- Avoid long-context (>500k) reliance—regression hits Opus hard.