GPT-5.5 Masters Tasks That Broke Prior Models
ChatGPT 5.5 shifts AI from answering simple queries to carrying complex, messy real-world workloads like executive packages (87% score), data migrations spotting fakes, and 3D viz, outperforming rivals on private benchmarks.
Floor Moved: GPT-5.5 Handles 'Carry the Work' Over Easy Answers
Previous model progress relied on inference-time boosts like extra thinking or tools, but GPT-5.5 advances the base model's intelligence. Public benchmarks confirm this: 82% on TerminalBench (software engineering), 84% on GPQA (knowledge work), topping Artificial Analysis's high-reasoning index by 3 points using fewer tokens than 5.4. The key shift? From "can the model answer this?" to "can it carry this?"—sustaining long contexts, producing multi-format artifacts, managing legal/ethical risks, and iterating without losing thread.
Nate Jones argues the best model matters most for "real and ugly" work: underspecified briefs, contradictory data, tool use amid uncertainty. Easy tasks (summaries, emails, basic apps) saturate across frontiers, masking differences. GPT-5.5, launched with Codex enhancements, file/browser access, and Images 2.0, forms a superior system. Compared to Anthropic's Opus 4.7 (strong in planning/UI taste but a 'bridge' release), 5.5 redefines ambitions as scaling laws persist.
"The old question was 'can the model answer this?' The new question is 'can the model carry this?'" (Nate Jones, contrasting benchmark saturation with sustained task endurance—core to why 5.5 feels like a 'big lift' daily.)
Dingo Test: Judgment and Production Discipline in Executive Packages
Dingo simulates a pet-tech startup (Dingo Box Pro automated litter box for dingoes/hybrids in Alaska, with subsidiary Northern Canada Imports). Absurd premise tests nuance: commercial viability amid legal/ethical risks (exotic pet regs), market sizing for qualified owners only, separating import risks from product.
Single prompt demands 23 deliverables: docs, 17-slide deck (26 media), spreadsheets (formulas/charts), PDF one-pager, interactive dashboard (using logo/hero), comms, FAQs, personas, email sequence, risk assessment, GTM plan. Weaker models produce polished text but fake artifacts (HTML as PPT) or ignore risks (implying easy ownership).
GPT-5.5 scores 87.3% (vs. Opus 4.7: 67%, Sonnet 4.7: 65%, Gemini 3.1 Pro: 49.8%). All artifacts usable: real file types, 34 regulatory URLs, dashboard functional. It nails posture—narrow qualified release, flags import risks, distinguishes curiosity from buyers, disclaimers ownership hazards. Defects minor (XML escape, NPS rounding, stale pricing)—'final mile' fixes, not structural fails.
Prior models drifted (shaky regs, underproduced artifacts). 5.5 compresses 'nothing to coherent first version' (structure, evidence, risks)—costliest executive phase.
"The deliverable is assemble the launch packet." (Jones on why impressive writing fails without production-ready files humans edit/send.)
Splash Brothers: Backend Hygiene in Messy Data Migrations
465-file folder mimics small biz chaos (car wash/detailing): CSVs/Excels (3 schemas), JSONs (one corrupted), VCFs, scanned receipt PDFs, notes, conflicts. Task: inventory, schema design, parse/merge/reject, audit report, review UI. Traps: fakes (Mickey Mouse, 'test customer', ASDF, $25K payment), 7 dupes, 13 typos, orphans (Terren Blackwood), service code conflicts, enum variances.
Prior runs (5.4, Opus 4.7) normalized fakes as real revenue/customers. 5.5 first to catch all semantic traps: rejects fakes/dupes/typos, discovers all files, 7,287-line report (per-file audit), 186/192 customers, deterministic DB.
But regressions vs. 5.4: misses service code column/conflicts, creates Blackwood canonically (needs review), 29 raw payment statuses, unnormalized methods, UI-DB count mismatch, overproduced services. Stronger on human-intuitive errors, weaker on 'boring' hygiene (enums, orphans, reconciliation).
Practical: Use 5.5 for first-pass (inventory/schema/extract/audit/UI), but validate (row counts, enums, human merges). Not production-canonical alone—build system trust.
"No Frontier model should be safe to trust with a oneshot business data migration. 5.5 narrows that claim, but doesn't eliminate it." (Jones on compressing middle work while needing safeguards.)
Artemis II: Research, Interactivity, and Visual Taste
Build interactive 3D NASA Artemis II viz (lunar flyby): research mission, model SLS, animate launch-flyby-return, environment/controls/timeline scrubbing/clickables/educational. No facts/stack provided.
Both 5.5/Opus 4.7 get mission right (flyby, not landing/orbit). 5.5: info-dense (bubbles/panels/labels), learnable but cartoonish. Opus edges visual composition/taste. Reveals OpenAI visual lag (pre-Images 2.0), routing needs (Opus for taste).
Tradeoffs, Routing, and Workflow Shifts
No model perfect: 5.5 regressions (Splash hygiene), needs validation. Private bench exposes generalization gaps—fixable via prompts/harnesses. Route: 5.5 for complex backend/intuitive polish; Claude/Opus for planning/UI taste. Codex > ChatGPT for file/code/browser work.
Current routing: 5.5 default for messy handoffs/migrations; validate production paths. Ambitions rise—ask it to 'carry' longer.
"Leaders evaluating models on easy tasks will conclude the differences are small—and they'll be right, but only about the wrong category of work." (Jones debunking 'frontiers interchangeable' myth for real/ugly tasks.)
Key Takeaways
- Test models on private, evolving 'fail-designed' benches for generalization, not saturated public ones.
- Prioritize 'carry' capacity: long-context sustainment, artifact production, risk posture over quick answers.
- For executive packages like Dingo, default to GPT-5.5—fixes structure/evidence fast, tweak finals.
- Data migrations: 5.5 first-passes messy files (catches fakes/dupes), but enforce schema validators/human review.
- Route by strength: 5.5 backend/complex; Opus taste/visuals; integrate systems (Codex/Images).
- Build around models: prompts, tools, validation compress expensive phases without blind trust.
- Track floor shifts—5.5 enables bolder asks as scaling compounds.
- Scores guide: Dingo 87% usable artifacts; Splash near-target DB but hygiene gaps.
"5.5 feels like a bigger pre-train showing up in everyday use." (Jones on intuitive 'smarter/efficient' feel beyond benchmarks.)