Continuous Streaming vs. Turn-Based Interaction
Gemini 3.5 Live Translate (gemini-3.5-live-translate-preview) shifts from traditional turn-based conversational models to a continuous stream processing pipeline. Unlike standard AI agents that wait for a speaker to finish a sentence before processing, this model translates audio in real-time as it streams. This design choice prioritizes low latency, keeping the output just a few seconds behind the speaker. To maintain this strict performance, the model is stripped of general agent capabilities—it does not support text input, tool use, or system instructions, functioning strictly as a specialized interpreter.
Technical Implementation and Integration
Developers can integrate the model via the Gemini Live API by configuring a translationConfig block within the generationConfig. Key parameters include:
targetLanguageCode: Uses BCP-47 language tags (e.g., "es", "pl") to define the output language.echoTargetLanguage: A boolean toggle that determines whether the model should repeat input that is already in the target language.
The system requires specific raw audio formats: 16-bit PCM at 16kHz (mono, little-endian) for input, and 16kHz/24kHz PCM for output. Data is sent in 100ms chunks, and developers are encouraged to use ephemeral tokens on the v1alpha endpoint to secure API keys in client-side applications. The model is designed to handle noisy, unpredictable environments, making it suitable for live meetings, broadcasts, and direct communication apps like those currently being tested by Grab.