AI Reimplements 16K LoC Toolkit in Autonomous Weeks-Long Task

MirrorCode Benchmark: Testing Long-Horizon Autonomous Coding via Reverse Engineering

MirrorCode, co-developed with METR, evaluates AI on reimplementing real CLI programs without source code access, using execute-only binaries as black-box oracles and detailed specs/docs. Agents explore program behavior via arbitrary inputs, then build matching functionality in C, Python, or Rust. Evaluation uses 100s-1000s of end-to-end tests from original suites, real data, and LLM-generated cases, requiring exact output matches. To prevent hardcoding, visible tests have held-out 'dual' variants (e.g., leap year checks across years). Targets were manually selected for evaluability, test coverage, and human feasibility: 24 programs across Unix utils, serialization, bioinformatics, etc. Focus here: choose (931 LoC Rust), cal (984 LoC C), gotree (16,905 LoC Go), Pkl (61,461 LoC Java/Kotlin). Sandboxed in Docker, no internet/third-party deps; uses Inspect ReAct agent scaffold with text_editor tool and compaction for 1B token runs (~$550). Human baselines score <100% due to underspecification without tests, but AI must generalize.

Decision to reimplement entire programs forces architectural choices, unlike short diffs in SWE-bench (most <100 LoC). Tradeoffs: Precise specs enable checkable success but rare in greenfield dev; no web access tests pure reasoning but limits real-world tools. Excluded memorized targets via verbatim reproduction checks.

"We find that AI models can autonomously reimplement complex existing software without access to the original program’s source code, provided there is a detailed, checkable specification."

This quote from the intro highlights why MirrorCode isolates long-horizon planning/architecture from translation, addressing gaps in short-task benchmarks and unevaluable demos like Anthropic's C compiler or Cursor's browser.

Breakthrough Results: Full Reimplementations of Real Software

Claude Opus 4.6 solved choose (648 LoC Rust, 127 tests), cal (1,157 LoC Rust, 1,365 tests), and gotree (7,644 LoC Rust, 2,001 tests)—passing 100% tests in Rust/Python/C averages. Gotree, with 40+ commands for phylogenetic trees (parsing formats, stats), is largest solved: AI parsed args modularly but duplicated parsing code and overloaded tree fields for metadata. Estimated human time: 2-17 weeks without AI. Pkl unsolved at 1B tokens (770 tests, ~2-3M output tokens produced), but score scaled up with budget, ignoring lazy eval initially then patching.

Performance anticorrelates with original LoC: older models (Opus 4.0/4.1) solved small ones (choose/cal); Opus 4.6 handled gotree. Other models (early tests) comparable/weaker. In gotree Python runs, newer models progressed faster, persevered past failures vs. older ones submitting prematurely.

"We guess this same task would take a human engineer without AI assistance 2–17 weeks."

This estimate, tied to gotree's complexity, quantifies 'weeks-long' capability, based on early human baselines under similar constraints.

Opus 4.6's gotree engineering beat priors: modularized commands, used maps/pointers for trees, but code quality mixed (duplication, hacks)—improvable via prompting. Pkl struggles show limits: stuck on eager vs. lazy eval despite docs, but scaling suggests potential rewrite.

Inference Scaling and Model Improvements Unlock Larger Projects

Success scales with tokens: Pkl progressed steadily to run end; gotree needed Opus 4.6's budget. Newer models faster at gotree (Figure 3), less premature submission. Perseverance key—Opus 4.6 iterated through failures. LoC in AI Rust solutions proxies complexity, normalizing languages.

"AI performance appears negatively correlated with the size of the original codebase. Smaller codebases, such as cal and choose, were solved by older models, whereas larger codebases, such as gotree, were only solved with recent models."

This observation explains why Opus 4.6 nears-perfect on sub-gotree sizes across 20+ targets, implying token scaling could solve Pkl/extremes like compilers.

Tradeoffs: 1B tokens viable but costly; compaction enables long runs. No deps forces self-contained code, inflating AI LoC vs. originals.

Plans: Full open-source release (private test set), more models/programs, scaled runs.

Limitations: Specs, Contamination, and Scope

Precise oracle/tests atypical—real dev lacks this feedback, shortening effective horizons vs. research eng/CUDA benchmarks. Contamination risk: Excluded memorized targets (e.g., weak cal evidence in older models), but subtle influences possible; prior benchmarks generalize despite mem. Narrow domains (no networking/DBs/graphics); real projects larger (browsers/OSes). Training on reimpl-like tasks? Ablations needed.

"Our evaluation relies on a very particular setup: an existing program that produces the canonical output for a given input, and hence acts as a highly detailed, precise specification... our results do not show that AI could perform any software implementation task."

This caveat clarifies translation to open-ended SWE, emphasizing feedback's role.

"It is unclear whether larger projects than gotree could be solved, given enough tokens. Experiments such as Anthropic’s C compiler suggest this may be feasible."

Acknowledges uncertainty for megaprojects while noting supportive evidence.

Key Takeaways

Use black-box oracles + comprehensive tests (with duals) to benchmark autonomous reimplementation, forcing full architecture design.
Scale inference to 1B+ tokens for complex tasks; newer LLMs (e.g., Opus 4.6) persevere better, progressing faster on large codebases.
Target selection matters: Prioritize evaluable CLI programs with good coverage; estimate human time via constrained baselines.
Mitigate memorization by verbatim checks and held-out sets; expect generalization per prior benchmarks.
For production, combine with prompting to fix AI code smells like duplication; specs accelerate long-horizon tasks.
Inference scaling likely solves 60K+ LoC like Pkl; test extremes (compilers) next.
Real-world: Feedback loops (tests/oracles) extend AI horizons beyond short diffs.
Benchmark fully autonomous sandboxes: No web/deps, ReAct scaffolds like Inspect.