GPT-5.5 Raises Floor for Messy Real Work

Floor Shift: From Answering to Carrying Complex Loads

GPT-5.5 doesn't just edge out GPT-5.4 on benchmarks— it relocates the baseline capability for what AI can reliably handle in production. Public metrics like 82% on Terminal Bench (software engineering) and 84% on GDP Val (knowledge work) confirm gains, but Nate Jones emphasizes qualitative leaps: the model grasps task shape faster, needs less guidance, and sustains intent over long contexts. This enables "carrying" workloads—messy, multi-step jobs with contradictory data, legal risks, and artifact production—where prior models faltered.

"The old question was, can the model answer this? The new question is, can the model carry this?" — Nate Jones, explaining why GPT-5.5 redefines usable AI scope beyond simple prompts.

Jones rejects the notion that frontier models are now interchangeable: easy tasks (summaries, basic code) saturate all, masking differences. Real value emerges in "ugly" work: underspecified briefs, file chaos, ethical tightropes. Here, GPT-5.5 feels "bigger and smarter," compressing human-heavy phases like structuring first drafts. Tradeoff: inference-time scaling (tools, compute) amplified gains, but raw pre-training intelligence drove the floor raise. No lab signals plateau—scaling compounds, ambitions scale with it.

Private Benchmarks Expose True Gaps: Dingo Dominance

Jones' Dingo test simulates executive handoff for a fictional Alaska pet-tech startup (Dingo Box Pro litter box, Northern Canine Imports subsidiary). Absurd premise tests judgment: market sizing for qualified owners only, legal/ethical risks (exotic pets), operational separation. Single prompt demands 23 artifacts: docs, 17-slide deck (26 media), formula-driven spreadsheets/charts, interactive dashboard (using logo/hero), PDF one-pager, FAQs, personas, email sequence, risk assessment, GTM plan.

GPT-5.5 scored 87.3/100, crushing Claude Opus 4.7 (67.0), Sonnet 4.7 (65.0), Gemini 3.1 Pro (49.8). It produced editable files (no fake HTML/PPTs), nailed posture—narrow qualified release, risk-flagging imports, no legalization implications—sourced 34 URLs for regs. Defects were polish-only: unescaped ampersand XML, minor NPS rounding, stale pricing.

Weaker models drifted: Opus shaky numbers, Sonnet strategy sans artifacts, Gemini fake files unusable for boards. Insight: GPT-5.5 excels at intent alignment + production discipline, slashing "nothing to coherent draft" time—core expense in exec work.

"Leaders evaluating models on easy tasks will conclude the differences are small—and they'll be right, but only about the wrong category of work." — Nate Jones, critiquing benchmark blindness to real workloads.

Data Migration Reality Check: Semantic Wins, Hygiene Lags

Splash Brothers mimics small-business chaos: 465 files (CSVs, Excels in 3 schemas, JSONs incl. corrupted, VCFs, receipt PDFs, junk). Task: inventory, schema design, parse/merge/reject, normalize services/prices, audit provenance, review UI. Traps: fakes (Mickey Mouse, $25K payment, test/ASDF), 7 dupes, 13 typos, orphans (Terence Blackwood), service code conflicts.

GPT-5.5 first to catch semantics: rejected fakes/dupes/typos, discovered all files, 7,287-line audit, 186/192 customers, deterministic DB. Vs. 5.4/Opus 4.7: they normalized fakes as real revenue.

But regressions surfaced on private bench: missed service codes (no schema column), created Blackwood canonically, 29 raw payment statuses, unnormalized methods, UI-DB count mismatch. 5.4 edged backend hygiene; 5.5 prioritized intuitive catches. Practical: Use for first-pass (inventory/schema/extract/UI), but validate enums/merges/rows humanly. No solo production trust—build system harness.

"5.5 is the first model to catch the mistakes I planted in the data on purpose. It rejected Mickey Mouse... the fake $25,000 payment." — Nate Jones, highlighting semantic progress narrowing 'no trust' gap.

Visual/Research Builds: Routing Beats Single-Model Reliance

Artemis II demands zero-shot 3D NASA mission viz: research Artemis 2 (lunar flyby), build SLS rocket/environment, animate launch-flyby-return, timeline scrub, clickable components, educational. Tests research + interactivity + taste.

GPT-5.5/Opus 4.7 nailed facts (flyby, not orbit/landing). GPT-5.5: info-dense (bubbles/panels), learnable but cartoonish scales/proportions. Opus: superior visual composition/taste. OpenAI stack bolsters: Codex (file/code/browser ops), Images 2.0 (visuals)—vs. Claude's planning edge.

Routing rules: GPT-5.5 for reasoning/exec/data; Claude Opus 4.7 for taste/planning/front-end; validate visuals/data. Codex > ChatGPT for serious work (artifact ops). Post-5.5 workflow: 5.5 fast modes for sharp starts, thinking modes for depth; evolve tests as models advance.

"The floor moved, not just the ceiling... 5.5 feels like a bigger pre-train showing up in everyday use." — Nate Jones, capturing intuitive capability jump.

Key Takeaways

Design private benches for generalization: hard, evolving tests (exec packets, dirty migrations, viz builds) over saturated public ones.
Route by strength: GPT-5.5 for carrying messy reasoning/data; Claude for visual taste/planning; Codex for file-heavy production.
Trust progression: Use 5.5 for 80% compression on complex first drafts/migrations, human-validate hygiene/risks.
Ignore easy-task parity: Differences explode on real/ugly work—underspecified, contradictory, multi-artifact.
System > weights: Pair models with tools/files/compute/images for workflow wins.
Expect regressions: Frontier tested off-distribution shows quirks (e.g., 5.5 backend slip vs. 5.4)—prompt/harness fixes.
Raise ambitions: Floor shifts enable bolder asks; scale still works, curve unbroken.