Star Elastic: Pack 30B/23B/12B Models in One Checkpoint

Train one 30B hybrid Mamba-Transformer-MoE parent model on 160B tokens to embed smaller 23B and 12B submodels as contiguous subsets of its highest-importance components. Rank embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels by contribution to accuracy using Router-Weighted Expert Activation Pruning (REAP), which weighs routing gates and output magnitudes over naive frequency pruning. A learnable end-to-end router takes a target budget (e.g., 2.8B active params) as one-hot input, outputs differentiable masks via Gumbel-Softmax, and trains jointly with knowledge distillation from the parent—penalizing budget deviations while maximizing accuracy. Use a two-stage curriculum: short-context (8K tokens, uniform budgets) then long-context (49K tokens, p(30B)=0.5, p(23B)=0.3, p(12B)=0.2), boosting AIME-2025 scores by up to 19.8% on smaller variants. Width compression (reducing dims/heads/experts) recovers 98.1% baseline performance versus 95.2% for depth (layer dropping), so prioritize width for reasoning tasks.

This yields 360x fewer tokens than separate pretraining and 7x over sequential distillation, with all variants zero-shot slicable from one 58.9 GB BF16 checkpoint—versus 126.1 GB for independents.

Phase-Specific Sizing Optimizes Reasoning Accuracy-Latency

Ditch fixed-model token caps in phases: assign smaller nested models (e.g., 23B) to high-volume reasoning traces and larger (30B) to precise final answers in ℳS → ℳL configs. The 23B→30B setup beats Nemotron Nano v3 defaults by 16% accuracy at 1.9x lower latency, as reasoning tolerates capacity cuts but answers demand precision. Elastic-23B hits 85.63 on AIME-2025 (vs. Qwen3-30B-A3B's 80.00), matching or exceeding same-size independents on GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, Tau Bench.

12B runs 2.4x throughput of 30B on H100 at BF16; NVFP4 12B hits 7,426 tokens/s (3.4x) on RTX Pro 6000.

Quantization Preserves Nesting for Edge Deployment

Apply Quantization-Aware Distillation (QAD) on the elastic checkpoint to maintain zero-shot slicing post-quant. FP8 PTQ recovers 98.69% BF16 accuracy on 30B; NVFP4 PTQ drops 4.12% but QAD (~5B tokens, 48K context) hits 97.79%. Single NVFP4 checkpoint: 18.7 GB (30B), enabling 12B/8 GB on RTX 5080 (BF16 OOMs). Memory table:

Variant	30B	23B	12B
BF16	58.9 GB	44.0 GB	23.2 GB
FP8	31.4 GB	23.7 GB	13.0 GB
NVFP4	18.7 GB	14.1 GB	8.0 GB

Load and Serve with Transformers or vLLM

Grab from HF: nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-{BF16|FP8|NVFP4}. Use trust_remote_code=True for hybrid arch.

Transformers example:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
# Generate with max_new_tokens=4096 for <think> + answer

vLLM for prod: vllm serve <model_id> (OpenAI API compat), or Docker/SGLang. Query via curl with max_tokens=4096, temperature=0.6.

Nested Weight-Sharing Compresses Multiple Sizes into One Checkpoint

Phase-Specific Sizing Optimizes Reasoning Accuracy-Latency

Quantization Preserves Nesting for Edge Deployment

Load and Serve with Transformers or vLLM

More on Edge

Sovereign AI Grounds Robotics in Physics for 1.1M States/Sec

Gemma 4 MTP Drafters: 3x Faster Inference, No Quality Loss

KAME: Zero-Latency S2S with Real-Time LLM Oracles

H2E: Deterministic Safety via Riemannian Multimodal Fusion