TwELL Format Enables Zero-Overhead Sparsity on GPUs

Feedforward layers consume ⅔ of LLM parameters and 80%+ of FLOPs, but activation sparsity leaves 99%+ of hidden neurons at zero post-ReLU. Standard ELLPACK sparsity fails on batched GEMM (training/high-throughput inference) due to dense-to-sparse conversion overheads that match or exceed savings on Tensor Core-optimized GPUs.

TwELL fixes this by tile-wise packing: partition gate activation columns into horizontal tiles matching matmul kernel tile size T_n (e.g., CTA dimensions). Pack non-zeros + indices locally per tile in ELL-style within the gate projection epilogue—no extra kernel, global read, or sync. Compression factor C ensures T/C > max non-zeros/tile; store as single 32-bit matrix for locality.

Inference fuses up/down projections in one kernel per input row: CTAs iterate tile non-zeros, loading W_u columns and W_d rows for dot products. Hidden state h_u stays in registers, slashing DRAM. Training uses hybrid format: route low-nz rows (<threshold) to compact ELL, overflow to dense backup, handling non-uniform sparsity (max nz/row >> average).

Supports gated MLPs (Llama/Qwen) and non-gated Transformers (11.2% inference speedup at L1=2e-5).

Induce Sparsity with ReLU + L1—No Hyperparam Tweaks

Replace SiLU with ReLU in gates for exact zeros on negatives. Add L1 loss on hidden activations (post-up projection, pre-down): L1 = 2×10⁻⁵ × mean(|h| over tokens/dims/layers), summed to CE loss.

Sparsity stabilizes in ~1000 steps (~1B tokens). At L1=2e-5, nz activations drop from 911 to 29/layer (99.5% sparse) in 1.5B model (d_ff=5632); 30%+ neurons die permanently but accuracy holds (46.4% → 46.2% tasks). Test 8 L1 values: up to 3e-5, <2% relative CE rise, no task accuracy drop (ARC/HellaSwag/etc.).

Mitigate dead neurons via gate weight reinitialization: +19.1% speedup vs +17.9% baseline, same sparsity/accuracy. Train on fineweb-edu (10-40B tokens, chinchilla-optimal), ctx=2048, batch=1M—no LR/optimizer/weight decay changes.

Speedups Grow with Scale; Patterns Favor Early Layers

On 8x H100 PCIe (seq=2048):

ModelInf SpeedupTrain ThroughputPeak Mem ΔEnergy/tok ΔAccuracy Δ
0.5B+17.0%-1.5%-19.2%-11.8%40.4→40.4%
1B+18.1%+7.1%-25.5%-14.6%44.6→44.7%
1.5B+18.8%+11.6%-28.1%-15.0%46.4→46.2%
2B+20.5%+21.9%+22.3%*-17.0%49.1→48.8%

*2B uses larger micro-batch via mem savings (46.7→57.1GB peak). Nz/layer falls 39→24 (0.5B→2B), amplifying skips. -0.996 Pearson corr: sparser layers = bigger gains.

Patterns: Layer 1-2 least active in 28L 1.5B; peak early-middle (reasoning/knowledge). Sequence pos 1 fires exponentially more neurons than later. Larger gains on RTX PRO 6000 (188 SMs): sparse thrives where dense GEMM lags.

Open-source kernels (H100 TMA/persistent CTAs; RTX verified), code for Llama/etc. Future: fine-tune dense models.