NVIDIA's Nemotron 3.5 ASR: Efficient Multilingual Streaming Speech

Architecture and Efficiency

Nemotron 3.5 ASR utilizes a Cache-Aware FastConformer-RNNT architecture designed to eliminate the redundant computation typically found in buffered streaming models. While traditional streaming models re-process overlapping audio windows, this model caches encoder self-attention and convolution activations. By reusing these states, the system processes each audio frame exactly once, significantly reducing compute requirements and end-to-end latency without sacrificing accuracy.

Configurable Latency and Language Handling

The model introduces an att_context_size parameter, allowing developers to tune the latency-accuracy trade-off at inference time. Settings range from an 80ms ultra-low-latency mode (for voice agents) to a 1.12s high-accuracy mode (for transcription), all using the same checkpoint.

Language support is handled via prompt-based conditioning. A single 600M-parameter model covers 40 language-locales, including English, Spanish, German, French, Arabic, Japanese, Mandarin, and others. The model supports a target_lang=auto mode, which enables the system to detect languages dynamically and emit language tags, facilitating the transcription of mixed-language audio streams without needing separate language-ID components.

Fine-Tuning and Performance

Because the model is released with open weights (OpenMDW-1.1), it is highly adaptable for specific domains, accents, or languages. NVIDIA demonstrated this by fine-tuning the base model on Greek and Bulgarian datasets. Using the same Cache-Aware FastConformer-RNNT recipe, they achieved relative Word Error Rate (WER) improvements of 32% for Greek and 31% for Bulgarian, proving that the base model serves as a robust foundation for specialized speech applications.

Architecture and Efficiency

Configurable Latency and Language Handling

Fine-Tuning and Performance

More from AI & LLMs

Data Scale, Not Latency, Drives Cross-Lingual ASR Transfer

Microsoft's MAI-Transcribe-1.5: Production-Ready Speech Recognition

Building Robust Voice AI: Beyond Simple Transcription

Optimizing Agentic Pipelines with Temporal Semantic Caching