The Recurrent-Depth Transformer (RDT) Premise

OpenMythos introduces a framework for building recurrent-depth transformers that allow for dynamic compute scaling at inference time. Unlike standard feed-forward transformers where depth is fixed at training, RDTs enable the model to perform additional computation by looping through its layers. This allows a single, fixed-parameter model to trade inference latency for increased reasoning depth, effectively extending its problem-solving capacity on complex tasks without requiring additional training.

Practical Implementation and Architecture

The OpenMythos library supports modern architectural components, including:

  • Attention Variants: Supports both Multi-Latent Attention (MLA), similar to DeepSeek-V2, for compressed KV caching, and Grouped-Query Attention (GQA).
  • Sparse MoE: Integrates Mixture-of-Experts components with shared experts to maintain parameter efficiency.
  • Stability Monitoring: The framework provides tools to calculate the spectral radius of the recurrent injection matrix. Maintaining a spectral radius (ρ) less than 1 is critical for ensuring the stability of the recurrent loops during training and inference.

Evaluating Loop-Scaled Reasoning

The author demonstrates the RDT capability using a synthetic compositional reasoning task: predicting the sum of digit chains modulo 7.

  • Training Strategy: The model is trained with a fixed number of recurrent loops (4) using AdamW and a cosine learning rate schedule. Loss is calculated specifically at the position following the 'EQ' token.
  • Inference Scaling: Once trained, the model's reasoning depth is tested by varying the number of loops (1, 2, 4, 6, 8) at inference time.
  • Results: The experiments show that increasing the loop count allows the model to maintain or improve accuracy on out-of-distribution (OOD) tasks involving longer digit chains. This confirms that the recurrent mechanism successfully enables the model to perform deeper reasoning by re-utilizing its existing weights, rather than relying on a static depth architecture.