Open Mythos RDT Reuses Layers for Deeper Reasoning
Recurrent Depth Transformer (RDT) loops a small set of layers up to 16 times with shared weights, matching 1.3B param transformers using just 770M params via hidden latent reasoning.
Recurrent Depth Transformer Enables Efficient Deep Thinking
Recurrent Depth Transformer (RDT) in Open Mythos, an open-source PyTorch implementation by Kai Gomez hypothesizing Anthropic's Claude Mythos, reuses a compact set of layers instead of stacking billions of parameters. Structure includes a prelude encoding input once, a core recurrent block looping up to 16 times, and a coda for output. Each loop updates hidden state via h_t = A * h_ + B * x + C * transformer(h_, x), blending prior state, original input, and new computation to prevent drift. This yields deeper inference-time thinking without deeper training architecture.
Mixture of Experts (MoE) with 384 total experts activates only 8 per input, varying per loop for fresh insights and countering inefficiency critiques. All reasoning occurs in continuous latent space without intermediate tokens or visible chain-of-thought, enabling parallel reasoning paths akin to breadth-first search. A 770M parameter RDT matches 1.3B standard transformer performance on same data, halving parameters while challenging scale-by-size assumptions.
Stability and Adaptive Fixes Unlock Scalable Loops
Recurrent loops risk exploding hidden states or overthinking into noise. Open Mythos stabilizes via linear time-invariant injection from Park K paper, bounding growth for arbitrary loop counts. Adaptive computation time per token halts loops early for simple inputs, allocating more to complex ones dynamically. Depth-wise LoRA adapters tweak behavior per iteration despite shared weights. Multi-latent attention, inspired by DeepSeek, compresses key-value pairs to 1/10-1/20 memory footprint.
These yield dynamic reasoning: train on 20-step chains, extrapolate to 30 via extra loops where standard transformers fail. RDT excels at systematic generalization, combining untrained knowledge combos that stump dense models, as bottleneck shifts from knowledge to effective combination.
Evidence from Benchmarks and Production Parallels
Open Mythos demonstrates depth extrapolation and generalization gains. Production echoes include Moonshot AI's 1T-parameter Kimiko 2.6: 384 MoE experts (subset active), multi-head latent attention, SwiGLU activations, 400M vision encoder for multimodality. It spawns 300 parallel agents for workflows, claw groups for human-AI hybrid tasks, scoring 54 on HLE full (2,500 doctorate-level questions across 100+ fields) vs. Claude Opus 4.6's 53 and GPT 5.4's 52.1.
Trend favors inference-time depth over parameter bloat: efficiency via MoE, modularity, parallelism. XAI's Grok APIs add voice with STT (25 languages, 5% phone entity error vs. competitors' 12-21%) and TTS (5 voices, 20 languages, $4.20/M chars), production-tested in Tesla/Starlink.