AI Progress Accelerates: Metrics for Self-Improving R&D

Capabilities Surge Past Forecasts, Signaling Economic Boom

AI agents now handle 12-hour software tasks reliably per METR benchmarks on Opus 4.6, beating Ajeya Cotra's January prediction of 24 hours by end-2026. At current pace, expect over 100-hour horizons by year-end, potentially dissolving the 'time horizon' concept for week-long work. Cotra notes her timelines were too conservative, with agents unlikely to struggle at 24-hour tasks after 10 more months of progress. This aligns with broader signals of rapid AI advancement colonizing economic activity via software explosions.

14 Metrics Track AI R&D Automation and Oversight Risks

To detect AI building AI (AIRDA, prerequisite for recursive self-improvement), measure these 14 indicators:

AI performance on AI R&D tasks.
AI vs. human/human-AI teams on AI R&D.
Oversight red teaming effectiveness.
Misalignment in AIRDA systems.
Efficiency gains on AI R&D tasks.
Staff surveys on AI productivity impact.
AI use in high-stakes decisions.
AI researchers' time allocation.
Oversight meta-effectiveness (e.g., bugs reaching production).
AI goal subversions.
AI researcher headcount and performance.
Compute distribution in AI R&D.
Compute as share of AI R&D spend.
AI system permissions over time.

Companies should track safety vs. capabilities progress, AI's oversight effects, and actual AIRDA extent via proxies like kernel/model training tests or staff studies. Governments need confidential aggregate reporting; third parties can estimate from public data (e.g., Epoch/SemiAnalysis compute tracking), build tools/surveys. Strong oversight requires understanding processes and controlling outputs to avert rushed destructive capabilities like WMDs or mass unemployment.

Edge AI Enables Scalable Real-World Sensing

Indian researchers prototyped city-scale traffic analytics with 1000+ cameras using NVIDIA Jetson edge GPUs co-located for low-latency processing: SAM3 segments frames, YOLOv8 detects/labels vehicles with BoT-SORT tracking. Edge nodes send insights to a central server for traffic hotspot maps, predictions, and federated learning—new classes trigger Jetson fine-tuning. Simulated on Raspberry Pi cluster, it avoids bandwidth bottlenecks for sustainable urban sensing.

For arctic monitoring, TinyIceNet—a tiny U-Net on Xilinx ZCU102 FPGA—estimates sea ice thickness from SAR data at 7 fps and 113.6 mJ/scene (vs. RTX 4090's 764.8 fps/228.7 mJ or Jetson AGX's 47.9 fps/1218.5 mJ). Trained on AI4Arctic (~533 files) with PyTorch on RTX 4090; HLS/DeepEdgeSoC optimizes for satellites, enabling on-device inference without raw data downlink.

Specialized Agents Speed AI Infrastructure

ByteDance/Tsinghua's CUDA Agent—Seed1.6 (23B active/230B total) fine-tuned on 6K operator samples using 128 H20 GPUs—excels at GPU code via OpenHands agentic loop: profile PyTorch impl, rewrite CUDA kernels, compile/eval in sandbox until 5% speedup over torch.compile. Handles 128K context/200 turns; hits 100%/100%/92% on KernelBench levels (beats Claude 4.5/Gemini 3 Pro by ~40% on Level-3), up from base 74%. Signals compounding: AI optimizes training infra for successors.