Nemotron 3 Nano Omni: Unified Open Model for Multimodal Agents
NVIDIA's 30B Nemotron 3 Nano Omni fuses text, vision (C-RadIO), and audio (Parakeet) encoders into one MoE model pretrained on 25T tokens, enabling fast local agents for document analysis, video understanding, and tool calls—detailed training recipes support fine-tuning.
Single-Model Multimodal Backbone Powers Agentic Workflows
Nemotron 3 Nano Omni is a 30B parameter Mixture-of-Experts (MoE) model with 3B active parameters, pretrained on 25 trillion tokens, integrating NVIDIA's C-RadIO vision encoder for images and videos alongside the Parakeet audio encoder used in their ASR systems. This unification handles text, images, videos, and audio in one forward pass, avoiding model suites. Target agent tasks include real-world document analysis, multi-image reasoning (e.g., comparing screenshots), automatic speech recognition, long audio/video understanding, and agentic computer use. Unlike proprietary multimodal models, this open release includes full architecture details: Nano base from Nemotron 3 family, vision adapter for video, and joint post-training. It outperforms on PinchBench, a benchmark measuring OpenCLaw agent performance, topping open ratings previously held by Nemotron 3 Super (120B, 1M context). Trade-off: for bulk transcription only, use standalone Parakeet to avoid overhead.
Detailed Training Recipes Enable Reproducible Fine-Tuning
NVIDIA provides tech reports unmatched in open models, breaking down pre-training data (languages, token counts), supervised fine-tuning (SFT) examples by type, vision/audio encoder tuning, joint omni-SFT, and RL for reasoning. Nemotron 3 Nano report specifies data mixes and SFT recipes driving capabilities; Omni extends with vision SFT, audio fine-tuning, and reasoning RL. Datasets are public on Hugging Face, letting you replicate for custom fine-tuning like improved OCR. This transparency addresses enterprise needs beyond weights: predictable responses and recipes for production agents. No other open multimodal paper details components, training stages, or data this granularly.
Fast Local Inference with Reasoning Controls and Tools
Run quantized versions (FP16 full, FP8, FP4, GGUF) locally via vLLM for low-latency inference without main machine resources—demo uses DGX Spark over LAN with Gradio UI. Controls include reasoning budget (tokens for chain-of-thought), thinking traces (green output), and system prompts (e.g., pirate mode). Examples: coin-flip probability with step-by-step evaluation (higher budget yields better accuracy); image analysis (describe charts, reason over tokens); audio transcription+extraction (podcast clip to key quotes); video processing. Agent integration: tool calls like 'capture_observation' on images yield structured JSON. Free on OpenRouter (text/images, limited audio/video), full via NVIDIA API or HF. Colab setup picks provider, toggles reasoning—low budget skips depth, risking quality; high budget maps complex reasoning explicitly.