AI Chart Generation Halves on Complex Real-Data Viz

RealChart2Code Exposes 50% Performance Drop on Complex Charts

Use RealChart2Code to test AI models realistically: it draws from 1,036 Kaggle datasets (860M rows) for 2,800+ cases spanning 50 chart types and composite layouts, unlike synthetic benchmarks like Plot2Code or ChartMimic. This uncovers the 'complexity gap'—models normalized at 96% on ChartMimic plummet to 50% here because real data demands precise data assignment, layout handling, and library calls. For production, prioritize models bridging this gap to avoid rebuilding viz from scratch.

Three tasks benchmark end-to-end skills:

Chart Replication: Code from image only—Gemini 3 Pro Preview leads at 9.0 score across 8 criteria (type, layout, axes, colors).
Chart Reproduction: Image + raw data—tests data-to-viz fidelity.
Chart Refinement: Fix broken code via dialog, mimicking dev workflows; models suffer 'regressive editing' by breaking fixed parts.

Automated scoring via multi-agent system aligns with humans (Cohen's Kappa 0.83), evaluating structure, text, and visuals on Matplotlib output.

Proprietary Models Outpace Open-Weight by 2x, But All Fail Layout and Data

Claude 4.5 Opus tops at 8.2 average score, Gemini 3 Pro Preview at 8.1; GPT-5.1 trails at 5.4. Open-weight like Qwen3-VL-235B (3.6) and Intern-VL-3.5-241B (3.4) collapse hardest—DeepSeek-VL-7B passes just 9.7% on replication due to hallucinated libraries (e.g., fake Matplotlib params in 20% cases) and invalid functions.

Proprietary errors shift to semantics: correct syntax but wrong data series on axes or mismatched attributes. Layout fails universally—overlapping subplots, broken grids—dropping pass rates below diagonal of simple benchmarks. Open-weight execution fails 90%+; refinement worsens code consistency.

To build reliable AI viz tools, combine proprietary for structure with rule-based checks for data/layout, as pure generation fidelity hits 45-50% max.

Limitations Highlight Path to Robust Viz AI

Benchmark sticks to Matplotlib, missing nuances like color shades or minor overlaps; no multi-library support yet. Still, it outperforms human prefs in related work like PaperBanana (45.8% fidelity but 73% preference over images via 5 agents + Matplotlib fallback).

Access at GitHub/Hugging Face to fine-tune models—focus training on real-data composites and iterative fixes to close the gap for agentic workflows.