AI Agents Speed Up GPU Kernels 1.81x with Scaffolding

METR addressed KernelBench's limitations—naive PyTorch baselines, outdated tasks from mid-2010s architectures, and noisy measurements—by filtering 45 low-quality tasks and adding Level 5 with 14 frontier workloads (DeepSeek-V3, Llama 3, Hunyuan Video, State Space Models like Mamba2). Filters targeted issues like low signal-to-noise outputs (-0.01 to 0.01 range), uniform tensors, constant functions (e.g., mean(softmax(x))), and seed-independent computations. They modified tasks with randomized weights or inputs to prevent caching cheats and switched to triton.testing.do_bench for reliable timings. Excluded Level 4 due to HuggingFace dependencies requiring codebase parsing. Result: 225 tasks across Levels 1-5, all single-H100 inference, emphasizing fusions like attention and convolutions in higher levels.

This setup better mirrors small research teams' needs, where naive PyTorch is common, but highlights gaps like no multi-GPU comms, fixed shapes, or training backprop. "KernelBench initial implementations are naive PyTorch, rather than highly optimized CUDA. This makes KernelBench much easier, more representative of small research groups, and not representative of large scale applications."

KernelAgent's Parallel Search Drives Cost-Effective Speedups

KernelAgent uses parallel tree search: 8 initial attempts (PyTorch/Triton/CUDA), resampling from best-so-far (weighted by speedup) for 300 total attempts/problem (40 for costly o1). Includes Triton/CUDA docs, ground-truth verification. Best-of-k across GPT-4o, Claude 3.5 Sonnet, o1 yields 1.81x geometric mean speedup vs. original KernelBench's 1.05x—15x better—attributed to scaffolding, prompt tuning, $20/task spend (vs. <$1 originally). o3-mini-high hits 1.81x alone; best across all models: 2.01x, doubling in 6 months.

Per-cost/per-attempt: o3-mini excels (power-law gains to 300 attempts), breakeven at 27 H100-hours/workload. Kernels: 80% Triton/CUDA (outperform PyTorch), Level 5 averages 500 LOC. Fine-tuning GPT-4o on 1200 solutions nearly matches o3-mini on Level 5 (except DeepSeek MOE). Released best solutions: https://github.com/METR/KernelBenchFiltered.

"Using our KernelAgent, o3-mini-high sped up code by 1.81x, and taking the best of all models per problem using KernelAgent achieved 2.01x speedup, representing a large increase from models released only 3 months earlier."

Tradeoffs: Strong on Niche Tasks, Far from Frontier Expertise

Agents succeed on 69-95% tasks (>2% speedup), 93% >5% with best-of-all, peaks at 30x. But vs. humans/frontier labs: 4 engineer-weeks vs. 5 engineer-years/model; agents can't adapt FlashAttention-2 or match expert novelty (token limits, no context factoring). Torch.compile wrappers lag (best-of-k flags <1.81x). Economic: Models beat humans for <$500 niches (hundreds compute dollars), not billions-scale.

Limitations persist: no human baselines, inference-only, 0.01 tolerance (vs. end-to-end stability), flawed tasks remain. Original KernelBench elicited fraction of capabilities sans scaffolding. "Our results do not imply that current LM agents can automate kernel engineering... kernels produced by our agents are never as novel and sophisticated as the best open source kernels produced by top experts."

Broader Implications for AI R&D Automation Risks

1.81-2.01x speedups show agents automate kernel bottlenecks (hundreds millions $/year savings), accelerating non-frontier ML (open-source research). Withheld agent code due to proliferation; sharing emphasizes evaluation rigor for safety policy. Capabilities advance fast; proper elicitation critical. "We believe our results highlight some challenges of doing evaluations, especially the importance of proper capability elicitation, commensurate to the economic value that models can actually provide."

"Model written kernels could fill the underserved niche of accelerating machine learning projects that use only hundreds of dollars of compute."

Key Takeaways

Filter benchmarks ruthlessly: Remove low-SNR, cheat-prone tasks (e.g., constants, uniform outputs) to isolate real optimization signal.
Invest in scaffolding: Parallel tree search + high compute ($20/task) unlocks 15x better elicitation than zero-shot.
Scale attempts power-law: o3-mini gains to 300+; breakeven at 27 H100-hours for PyTorch models.
Prioritize Triton/CUDA: Agents favor them (80%), outperforming PyTorch for complex fusions.
Benchmark small-team realities: Naive PyTorch + single GPU reveals niche where AI > humans economically.
Add frontier tasks: Level 5 (DeepSeek-V3 etc.) stresses agents; fine-tuning closes gaps fast.
Context matters: 1.81x vs. 1.05x shows eval pitfalls; always compare to baselines like torch.compile.
Safety via sharing: Measure R&D automation transparently for policy, despite capability risks.
Humans still win frontiers: Agents lag expert novelty by orders of magnitude in effort/quality.

Benchmark Refinements Unlock Realistic Kernel Optimization Measurement

KernelAgent's Parallel Search Drives Cost-Effective Speedups

Tradeoffs: Strong on Niche Tasks, Far from Frontier Expertise

Broader Implications for AI R&D Automation Risks

Key Takeaways