The Data-Scale Threshold for Multilingual Initialization
When adapting streaming speech recognition models to new languages, the choice between a multilingual (ML) or English-only (EN) encoder is primarily a function of available data volume rather than streaming latency. Research using a 0.6B-parameter FastConformer transducer across eight European languages demonstrates that the performance gap between ML and EN initialization follows a power-law decay.
At 100 hours of target-language data, the multilingual encoder provides a clear advantage (a +4.21 percentage point gap in Word Error Rate on the FLEURS benchmark). However, this advantage diminishes rapidly as data increases: at 2500 hours, the gap shrinks to a negligible +0.20 percentage points. Essentially, every doubling of target-language data roughly halves the remaining performance benefit of the multilingual initialization.
Latency and Quantization Independence
Contrary to common intuition, tight streaming latency requirements do not amplify the benefits of multilingual initialization. The study found that the EN-ML performance gap remains stable across various streaming tiers (from 160ms to offline decoding) at any given data scale.
Furthermore, the research confirms that engineers can make quantization decisions independently of initialization strategy. Applying 4-bit weight-only encoder quantization at a 560ms streaming tier reduces the encoder footprint by approximately 3x while incurring a minimal average WER increase of only 0.5 percentage points. The practical takeaway for builders is clear: prioritize multilingual initialization only when data is scarce; at scale, the initialization method becomes irrelevant, and latency/quantization optimizations can be managed as separate, independent engineering tasks.