GPT-Realtime-2 Enables Natural Multi-Step Voice Agents
Use GPT-Realtime-2 for voice agents that reason like GPT-5, process 128K token context (4x prior 32K), handle interruptions, and maintain long conversations without stalling. Enable preamble phrases like "let me check that" to fill silence during tool calls or multi-step tasks—users hear narration instead of dead air, fixing common production failure modes.
Tune reasoning across five levels (minimal, low, medium, high, xhigh; default low for low latency) to balance speed and depth: quick lookups stay fast, complex bookings get full compute. Adjust tone dynamically—calm for troubleshooting, empathetic for frustration, upbeat post-resolution. It grasps industry terms like healthcare vocab.
Benchmarks prove gains: high reasoning hits 96.6% on Big Bench Audio (vs 81.4% GPT-Realtime-1.5, +15.2 points) for audio reasoning; xhigh scores 48.5% on Audio MultiChallenge (vs 34.7%) for multi-turn dialogue, instruction following, and corrections. Pricing: $32/1M input tokens ($0.40 cached), $64/1M output.
Dedicated Pipes for Translation and Streaming Transcription
Pipe speech through GPT-Realtime-Translate for live translation from 70+ input languages to 13 outputs at speaker pace—ideal for bilingual support or events, but lacks agent reasoning (use GPT-Realtime-2 for that). Costs $0.034/min.
Stream transcripts in real-time with GPT-Realtime-Whisper: tune latency for partial text (low delay) or higher quality (more delay), beating batch Whisper for live captions, meeting notes, or continuous agent input. At $0.017/min, it makes voice apps feel responsive.
Production Setup: Session Types and Controls
Select voice-agent (reasoning responses), translation (language pipe), or transcription (STT only) sessions. New voices Cedar/Marin available. API now generally available—test in Playground, deploy without beta risks. Full details: OpenAI announcement.