Recurrent-Depth Transformers Scale Reasoning via Inference Loops

Recurrent-Depth Transformers (RDTs), or Looped Transformers, differ from standard transformers by reusing a fixed set of weights iteratively across T loop steps (up to 16 in OpenMythos) in a single forward pass. This decouples reasoning depth from parameter count: deeper reasoning comes from more loops at inference, not more layers or params. The structure follows Prelude → Recurrent Block → Coda, where Prelude and Coda are one-time standard transformer layers.

In the Recurrent Block, update hidden state ht+1 = A·ht + B·e + Transformer(ht, e), with encoded input e re-injected each step to prevent drift. This mimics draft refinement, enabling continuous latent-space reasoning without mid-loop token emissions—equivalent to chain-of-thought over vectors, per Saunshi et al. (2025). Unlike standard transformers failing on unseen depths (e.g., 5-hop trained model flops on 10-hop), RDTs extend depth at inference without retraining: allocate more loops to hard problems.

Replace standard FFN with Mixture-of-Experts (MoE) from DeepSeekMoE: sparse top-K experts per token plus shared experts, routed differently per loop for distinct computation despite tied weights. Use Multi-Latent Attention from DeepSeek-V2, caching compressed low-rank KV latents for 10–20× KV memory savings.

Stability and Adaptive Depth Prevent Explosion or Overthinking

Looping risks residual explosion (unbounded ht growth) or overthinking (drift past solutions). Enforce Linear Time-Invariant (LTI) constraint from Parcae: spectral radius ρ(A) < 1 by construction, ensuring stability independent of learning rate. Add Adaptive Computation Time (ACT) halting: learned scalar per position dynamically stops loops when converged—harder tokens get more compute.

Depth-Wise LoRA adapters apply small rank-r matrices per iteration, differentiating behavior without bloating params, blending pure tying and unique layers.

Half the Params, Equivalent Performance via Predictable Scaling

At 770M params, OpenMythos RDT matches 1.3B standard transformer on identical data, per Parcae (Prairie et al., 2026) scaling laws: optimal recurrence and token count follow power laws. This shifts scaling focus from training params to inference loops, challenging bigger-is-better assumptions.

OpenMythos delivers PyTorch code for RDT with MoE, LTI training, LoRA adapters, and baselines—falsifiable hypothesis for Claude Mythos, runnable for experimenting with looped dynamics.