AI Chart Code Gen Halves on Complex Real Data Benchmarks

Benchmark Exposes Complexity Gap in Chart-to-Code Generation

RealChart2Code tests AI models' ability to generate Matplotlib code for complex visualizations from real Kaggle datasets totaling 1,036 sources and 860 million rows, creating 2,800+ cases across 50 chart types and composite layouts. Unlike prior synthetic-data benchmarks like Plot2Code and ChartMimic, it reveals a 'complexity gap' where models ace simple charts (e.g., Gemini 3 Pro Preview at 96% normalized on ChartMimic) but collapse to ~50% on real-world complexity. Use this to evaluate LLMs for production data viz pipelines: expect proprietary models to lead but still fail half the time on multi-panel grids or large datasets.

Three tasks measure distinct skills:

Chart Replication: Image to code only—tests visual parsing (Gemini 3 Pro Preview tops at 9.0 score across 8 criteria: type, layout, axes, colors, etc.).
Chart Reproduction: Image + raw data CSV—verifies data-code alignment (drops scores further due to scale).
Chart Refinement: Iterative dialog to fix broken code—simulates dev workflows but triggers 'regressive editing' where fixes break prior correct sections.

Automated multi-agent scoring (Cohen's Kappa 0.83 vs humans, Fleiss' Kappa 0.82 inter-agent) flags failures in layout, data mapping, and syntax, making it reliable for your own evals.

Proprietary Models Outperform but Data Mixing Persists

Among 14 tested models, proprietary leaders shine on basics but falter overall:

Model	Avg Score (8 criteria)	Replication Pass Rate
Claude 4.5 Opus	8.2	High
Gemini 3 Pro Preview	8.1	9.0 (top)
GPT-5.1	5.4	Mid

Open-weight trail: Qwen3-VL-235B (3.6), Intern-VL-3.5-241B (3.4), DeepSeek-VL-7B (9.7% pass rate, >90% code fails to run). All models plot below the equal-performance diagonal vs prior benchmarks, with open-weight dropping steepest (Qwen3-VL from 85% to <25%).

To build robust viz tools, prioritize proprietary for syntax/layout but verify data assignments—proprietaries rarely hallucinate code but swap series to wrong axes or mismatch attributes 20-30% of cases.

Failure Modes Block Production Use

Error patterns differ by model type, limiting reliability:

Open-weight: Hallucinate non-existent libraries/functions (e.g., Qwen3-VL invents Matplotlib params in 20% cases), cause runtime fails, or botch layouts (overlapping subplots, broken grids).
Proprietary: Generate runnable code with correct structure but wrong data binding—visuals look similar but axes/colors misalign.

Refinement worsens issues: models over-edit, breaking fixed parts for 'local consistency' over global. Radar charts confirm leaders dominate 8 criteria but gap widens on data-heavy tasks.

Limitations: Matplotlib-only; auto-eval misses subtle overlaps/colors. Download from GitHub/Hugging Face to test your workflows—pair with projects like PaperBanana (45.8% fidelity via 5 agents + Matplotlib fallback, preferred 73% over pure image gen).