OpenMythos: 770M RDT Matches 1.3B Transformer
OpenMythos reconstructs Claude Mythos as a Recurrent-Depth Transformer (RDT) in PyTorch, using looped weights for reasoning depth that delivers 1.3B transformer performance at 770M params—half the size via inference-time iteration.
Recurrent-Depth Transformers Scale Reasoning with Loops, Not Layers
Standard transformers like GPT or Llama stack unique layers with independent weights, where capability ties directly to parameter count. Recurrent-Depth Transformers (RDTs), or Looped Transformers, reuse a fixed set of weights iteratively across T=16 loop steps in a single forward pass. This decouples reasoning depth from stored parameters: run more loops at inference for harder problems, exit early for simple ones.
The structure follows Prelude → Recurrent Block → Coda. Prelude and Coda are one-time standard transformer layers. The Recurrent Block updates hidden state ht+1 = A·ht + B·e + Transformer(ht, e), reinjecting encoded input e each step to prevent drift. Reasoning stays in continuous latent space—no mid-loop token emissions—equivalent to chain-of-thought over vectors, per Saunshi et al. (2025). This supports multi-step reasoning natively: a model trained on 5-hop chains handles 10-hop at inference by doubling loops, unlike fixed-depth transformers.
FFN uses Mixture-of-Experts (MoE) from DeepSeekMoE: sparse top-K experts per token plus shared experts, with router selecting distinct subsets per loop for varied computation. Attention employs Multi-Latent Attention from DeepSeek-V2, compressing KV to latents for 10–20× memory savings.
Stability and Adaptive Depth Prevent Explosion or Overthinking
Looped models risk residual explosion (unbounded ht growth) or overthinking (drift past solutions). OpenMythos enforces Linear Time-Invariant (LTI) constraints from Parcae: spectral radius ρ(A) < 1 by construction, ensuring stability independent of learning rate.
Adaptive Computation Time (ACT) halting uses a learned scalar per position to stop loops dynamically—harder tokens get more compute. Depth-Wise LoRA adapters add low-rank matrices per iteration, differentiating behavior without full untying, keeping params lean.
Half the Params for Equivalent Performance Reshapes Scaling
Parcae (Prairie et al., 2026) shows 770M RDT matches 1.3B dense transformer on identical data—~50% param efficiency. Optimal recurrence and token count follow power laws, yielding predictable scaling for looped training. Inference compute via loop depth becomes the key axis, not training params, challenging bigger-is-better assumptions.
OpenMythos delivers PyTorch code for RDT with MoE, LTI injection, depth-LoRA, and baselines—falsifiable hypothesis for testing Claude Mythos and advancing looped architectures beyond parameter races.