OpenAI's Realtime Voice Models Add Reasoning, Translation, Transcription

GPT-Realtime-2 Enables Voice Agents to Reason, Use Tools, and Recover Gracefully

Build voice-to-action agents that handle complex requests by reasoning with GPT-5-class intelligence while maintaining natural conversation flow. Increase context from 32K to 128K tokens for longer sessions supporting agentic workflows like Zillow's home search ("find homes within BuyAbility, avoid busy streets, schedule tour"). Key upgrades include preambles (e.g., "let me check that") to signal tool use, parallel tool calls with audible transparency ("checking calendar"), and adjustable reasoning levels (minimal to xhigh, default low) to balance latency and depth—low for quick chats, xhigh for puzzles. Stronger recovery says "I’m having trouble" instead of failing silently, better tone control (calm for issues, upbeat for wins), and domain retention for terms like healthcare vocab. On evals, GPT-Realtime-2 (high) scores 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio for reasoning; (xhigh) 13.8% higher on Audio MultiChallenge for instruction following/context. Zillow reports 26-point call success lift (95% vs 69%) and better Fair Housing compliance.

Live Multilingual Translation Preserves Natural Speech Pace

Use GPT-Realtime-Translate for voice-to-voice across languages, supporting 70+ inputs to 13 outputs, ideal for support (Deutsche Telekom), sales, education. It keeps up with speakers' natural pace, regional accents, and domain terms without losing meaning—BolnaAI saw 12.5% lower Word Error Rates on Hindi/Tamil/Telugu vs competitors, plus lower fallbacks/higher completion. Combine with transcripts for global events/media; Vimeo demo translates product videos live.

Low-Latency Transcription Powers Real-Time Workflows

GPT-Realtime-Whisper streams speech-to-text as spoken, enabling instant captions, in-progress meeting notes/summaries, continuous voice agent input, and fast follow-ups in support/healthcare/sales. Reduces perceived latency for responsive apps.

Production Trade-offs: Safety, Pricing, Availability

Active classifiers halt harmful sessions per usage policies; add custom guardrails via Agents SDK. Supports EU data residency/enterprise privacy. Pricing: GPT-Realtime-2 at $32/1M input tokens ($0.40 cached), $64/1M output; Translate $0.034/min; Whisper $0.017/min. Test in Playground; build via Codex prompt for WebRTC agent with sample check_calendar tool.

GPT-Realtime-2 Enables Voice Agents to Reason, Use Tools, and Recover Gracefully

Live Multilingual Translation Preserves Natural Speech Pace

Low-Latency Transcription Powers Real-Time Workflows

Production Trade-offs: Safety, Pricing, Availability

More on Edge

DART: Improving Agent Reliability via Semantic Recoverability

Claude Dreaming: 6x Agent Boost via Memory Cron Jobs

Build Agent Evals: Traces to Experiments

Gemini Enables Agentic Tasks and Prompt-Based Widgets on Android