Architecture and Efficiency
Nemotron 3.5 ASR utilizes a Cache-Aware FastConformer-RNNT architecture designed to eliminate the redundant computation typically found in buffered streaming models. While traditional streaming models re-process overlapping audio windows, this model caches encoder self-attention and convolution activations. By reusing these states, the system processes each audio frame exactly once, significantly reducing compute requirements and end-to-end latency without sacrificing accuracy.
Configurable Latency and Language Handling
The model introduces an att_context_size parameter, allowing developers to tune the latency-accuracy trade-off at inference time. Settings range from an 80ms ultra-low-latency mode (for voice agents) to a 1.12s high-accuracy mode (for transcription), all using the same checkpoint.
Language support is handled via prompt-based conditioning. A single 600M-parameter model covers 40 language-locales, including English, Spanish, German, French, Arabic, Japanese, Mandarin, and others. The model supports a target_lang=auto mode, which enables the system to detect languages dynamically and emit language tags, facilitating the transcription of mixed-language audio streams without needing separate language-ID components.
Fine-Tuning and Performance
Because the model is released with open weights (OpenMDW-1.1), it is highly adaptable for specific domains, accents, or languages. NVIDIA demonstrated this by fine-tuning the base model on Greek and Bulgarian datasets. Using the same Cache-Aware FastConformer-RNNT recipe, they achieved relative Word Error Rate (WER) improvements of 32% for Greek and 31% for Bulgarian, proving that the base model serves as a robust foundation for specialized speech applications.