Building Real-Time Speech Translation with Gemini 3.5 Live

Continuous Streaming vs. Turn-Based Interaction

Gemini 3.5 Live Translate (gemini-3.5-live-translate-preview) shifts from traditional turn-based conversational models to a continuous stream processing pipeline. Unlike standard AI agents that wait for a speaker to finish a sentence before processing, this model translates audio in real-time as it streams. This design choice prioritizes low latency, keeping the output just a few seconds behind the speaker. To maintain this strict performance, the model is stripped of general agent capabilities—it does not support text input, tool use, or system instructions, functioning strictly as a specialized interpreter.

Technical Implementation and Integration

Developers can integrate the model via the Gemini Live API by configuring a translationConfig block within the generationConfig. Key parameters include:

targetLanguageCode: Uses BCP-47 language tags (e.g., "es", "pl") to define the output language.
echoTargetLanguage: A boolean toggle that determines whether the model should repeat input that is already in the target language.

The system requires specific raw audio formats: 16-bit PCM at 16kHz (mono, little-endian) for input, and 16kHz/24kHz PCM for output. Data is sent in 100ms chunks, and developers are encouraged to use ephemeral tokens on the v1alpha endpoint to secure API keys in client-side applications. The model is designed to handle noisy, unpredictable environments, making it suitable for live meetings, broadcasts, and direct communication apps like those currently being tested by Grab.

Continuous Streaming vs. Turn-Based Interaction

Technical Implementation and Integration

More from AI & LLMs

Optimizing AI Apps with LLM Routing

RAG-Anything + LightRAG Handles Images/Charts in PDFs

Consolidating Productivity Tools into a Single Python AI Agent

Building PressLens: Using LLMs to Quantify Media Bias