gpt-image-2 Masters Hidden Details in Waldo Tests

High-Res Boosts Detail Fidelity

Use gpt-image-2 with --quality high --size 3840x2160 for complex scenes requiring fine details and integrated text. This produces a 17MB PNG (compressible to 5MB WEBP) where a raccoon holding a ham radio appears clearly bottom-left in a crowded fairground—absent in standard resolution outputs. Lower-res or older gpt-image-1 fails entirely, as no raccoon is discernible even with Claude Opus 4.7 analysis. Google's Nano Banana 2 places it obviously in a central booth, but lacks gpt-image-2's subtlety and quality.

Prompt precisely: "Do a where's Waldo style image but it's where is the raccoon holding a ham radio." Where's Waldo tests expose limits in handling occlusion, text rendering (e.g., booth signs), and dense compositions—ideal for evaluating production-ready illustration capabilities.

Script for Reliable Generation

Wrap OpenAI's Python client in a script like openai_image.py to access gpt-image-2 pre-official SDK support:

OPENAI_API_KEY="$(llm keys get openai)" \
  uv run https://tools.simonwillison.net/python/openai_image.py \
  -m gpt-image-2 "prompt" --quality high --size 3840x2160

Costs 13,342 output tokens at $30/million (~40 cents). Reference OpenAI's cookbook for outputQuality and sizes. Nano Banana Pro via AI Studio yields poor results, reinforcing model-specific trade-offs.

Hallucination Risks in Self-Analysis

Don't trust generated models to locate elements in their own images—ChatGPT circled a nonexistent raccoon, proving unreliability for puzzle-solving. This underscores separating generation from verification: use external models like Claude for spotting, but expect failures on ambiguous prompts due to top-left instruction cards mimicking scene elements.