Nested Weight-Sharing Compresses Multiple Sizes into One Checkpoint

Train one 30B hybrid Mamba-Transformer-MoE parent model on 160B tokens to embed smaller 23B and 12B submodels as contiguous subsets of its highest-importance components. Rank embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels by contribution to accuracy using Router-Weighted Expert Activation Pruning (REAP), which weighs routing gates and output magnitudes over naive frequency pruning. A learnable end-to-end router takes a target budget (e.g., 2.8B active params) as one-hot input, outputs differentiable masks via Gumbel-Softmax, and trains jointly with knowledge distillation from the parent—penalizing budget deviations while maximizing accuracy. Use a two-stage curriculum: short-context (8K tokens, uniform budgets) then long-context (49K tokens, p(30B)=0.5, p(23B)=0.3, p(12B)=0.2), boosting AIME-2025 scores by up to 19.8% on smaller variants. Width compression (reducing dims/heads/experts) recovers 98.1% baseline performance versus 95.2% for depth (layer dropping), so prioritize width for reasoning tasks.

This yields 360x fewer tokens than separate pretraining and 7x over sequential distillation, with all variants zero-shot slicable from one 58.9 GB BF16 checkpoint—versus 126.1 GB for independents.

Phase-Specific Sizing Optimizes Reasoning Accuracy-Latency

Ditch fixed-model token caps in phases: assign smaller nested models (e.g., 23B) to high-volume reasoning traces and larger (30B) to precise final answers in ℳS → ℳL configs. The 23B→30B setup beats Nemotron Nano v3 defaults by 16% accuracy at 1.9x lower latency, as reasoning tolerates capacity cuts but answers demand precision. Elastic-23B hits 85.63 on AIME-2025 (vs. Qwen3-30B-A3B's 80.00), matching or exceeding same-size independents on GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, Tau Bench.

12B runs 2.4x throughput of 30B on H100 at BF16; NVFP4 12B hits 7,426 tokens/s (3.4x) on RTX Pro 6000.

Quantization Preserves Nesting for Edge Deployment

Apply Quantization-Aware Distillation (QAD) on the elastic checkpoint to maintain zero-shot slicing post-quant. FP8 PTQ recovers 98.69% BF16 accuracy on 30B; NVFP4 PTQ drops 4.12% but QAD (~5B tokens, 48K context) hits 97.79%. Single NVFP4 checkpoint: 18.7 GB (30B), enabling 12B/8 GB on RTX 5080 (BF16 OOMs). Memory table:

Variant30B23B12B
BF1658.9 GB44.0 GB23.2 GB
FP831.4 GB23.7 GB13.0 GB
NVFP418.7 GB14.1 GB8.0 GB

Load and Serve with Transformers or vLLM

Grab from HF: nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-{BF16|FP8|NVFP4}. Use trust_remote_code=True for hybrid arch.

Transformers example:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
# Generate with max_new_tokens=4096 for <think> + answer

vLLM for prod: vllm serve <model_id> (OpenAI API compat), or Docker/SGLang. Query via curl with max_tokens=4096, temperature=0.6.