NVIDIA's Nemotron 3 Ultra: A 550B Hybrid Mamba-Transformer for Agents

Architecture for Agentic Efficiency

Nemotron 3 Ultra is a 550B parameter Mixture-of-Experts (MoE) model that activates only 55B parameters per token. Its core innovation is a hybrid Mamba-Attention architecture. By integrating Mamba layers, the model achieves sub-quadratic scaling for long sequences, keeping per-step decode costs constant as sequence length grows. This design choice is specifically intended to improve throughput for decode-heavy agentic tasks where token counts accumulate over time. The model features 108 layers, a dimension of 8,192, and uses 512 experts with a top-22 routing strategy.

Advanced Post-Training and Distillation

NVIDIA utilized a multi-stage post-training pipeline to refine the model's reasoning and tool-use capabilities. This includes Supervised Fine-Tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR) across 15 environments, including software engineering and math. To overcome the dilution of learning signals in multi-environment RL, NVIDIA introduced Multi-teacher On-Policy Distillation (MOPD). In this process, the student model generates rollouts that are scored by over ten domain-specialized teacher models, providing dense, token-level guidance. The model also supports inference-time budget control, allowing users to trade roughly 7% accuracy for a 2.5x reduction in token usage via a "medium-effort" mode.

Deployment and Performance

NVIDIA released the model as a single NVFP4 checkpoint, operating at 5.03 bits-per-element. This quantization strategy allows the model to fit on a single 8-GPU H100 node, whereas an FP8 checkpoint would require multi-node scaling. Performance benchmarks show the model is highly competitive in agentic tasks, achieving 71.9 on SWE-Bench Verified and 94.7 on RULER at 1 million tokens. While it trails some models in prefill-heavy workloads, it demonstrates up to 5.9x higher throughput than comparable models like GLM-5.1 in decode-heavy scenarios when using TRT-LLM.

Architecture for Agentic Efficiency

Advanced Post-Training and Distillation

Deployment and Performance

More from AI & LLMs

Recursive Model Improvement: Scaling AI Training at Cursor

HyphaeDB: Moving From Passive Storage to Agent-Native Memory

Tree of Evidence: Hierarchical Fact-Checking Against AI Misinformation

RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis