Parcae Stabilizes Loops to Match 2x Transformer Quality
Parcae enforces looped transformer stability via negative diagonal matrices in a dynamical system, outperforming baselines and achieving 87.5% of a twice-sized Transformer's quality at half parameters.
Designing Stable Looped Architectures
Looped transformers route activations through a fixed block of layers T times, boosting compute without adding parameters—ideal for memory-constrained edge deployment. Parcae uses a middle-looped structure: prelude (P) embeds input to latent e; recurrent block (R) updates hidden state h_t for T loops with e injected each iteration; coda (C) outputs from final h_T. Prior looped models like RDMs fail due to residual state explosion and loss spikes from unconstrained dynamics.
Model the loop as a nonlinear dynamical system: h_{t+1} = Ā h_t + B̄ e + R̄(h_t, e). Stability requires spectral norm ρ(Ā) < 1. Parcae discretizes a continuous system using zero-order hold and Euler integration with learned step Δ: Ā = exp(Δ A), B̄ = Δ B. Constrain A as diagonal with negative entries A = Diag(-exp(log A)), ensuring ρ(Ā) < 1 by design—no hyperparameter tuning needed for convergence. This fixes addition-based (ρ(Ā)=1, marginal) and concatenation-projection (ρ(Ā)>1, unstable) flaws in priors.
Beating Baselines with Parameter Efficiency
On Huginn, 350M Parcae drops validation perplexity 6.3% vs RDMs (10.76 to 10.09 PPL), 9.1% on WikiText, +1.8 downstream accuracy points. At 100M, 4.5% PPL gain (14.23 to 13.59). On FineWeb-Edu (104B tokens, nanochat setup), 1.3B Parcae scores 2.99 points higher on Core, 1.18 on Core-Extended than parameter-matched Transformers. Critically, 770M Parcae hits 25.07 Core—matching 1.3B Transformer's 25.45—delivering up to 87.5% of twice-sized Transformer's quality.
Looping adds an orthogonal scaling axis: isoFLOP tests at 140M/370M show looped Parcae (optimal mean recurrence μ_rec) beats fixed-depth (μ_rec=1) by 1.2-2.0 Core points under same params/FLOPs.
First Scaling Laws for Recurrence Depth
Optimal μ_rec scales as C^{0.40}, training tokens as C^{0.78} (C= FLOP budget), holding across scales. Test-time loop count T beyond training saturates via L(T) = L_∞ + Z e^{-z T}, plateauing near training μ_rec—setting a ceiling on extrapolation. This parametric law predicts held-out loss with 0.85-1.31% error, enabling reliable planning: train deeper loops for compute-optimal quality without memory bloat.