Build Voice Agents with GPT-5 Reasoning at Low Latency

OpenAI's GPT-Realtime-2 handles complex live voice tasks—tracking context, tool calls, interruptions—while matching GPT-5 reasoning. Expand context from 32k to 128k tokens for longer dialogues. Use parallel tool calls with audible feedback like 'let me check that' or preambles ('one moment') to buy thinking time without silence. Adjust reasoning via five levels (minimal to xhigh; default low for speed), enabling calm tones for problem-solving or empathy for frustrated users. It excels on specialized terms (medical, proper names). Benchmarks show gains: 96.6% accuracy on Big Bench Audio at high (vs 81.4% prior), 48.5% pass rate on Audio MultiChallenge at xhigh (vs 34.7%). Beats GPT-Realtime-1.5 overall for reliable production agents.

Three Patterns for Voice-Driven Products

Combine models into patterns for real-world apps. Voice-to-Action: Speak requests; AI reasons, tools, executes (e.g., bookings). Systems-to-Voice: Apps speak contextual guidance (e.g., travel app reroutes post-delay, confirms luggage). Voice-to-Voice: Cross-language talks (Deutsche Telekom tests for support). These roll out soon to ChatGPT audio, positioning voice as primary UI for customer support, sales, education.

Add Translation and Transcription for Workflows

GPT-Realtime-Translate covers 70+ input/13 output languages, preserving meaning amid accents or switches—ideal for global support/events. GPT-Realtime-Whisper streams low-latency captions for meetings/classrooms, generating live notes/summaries to speed healthcare/recruiting follow-ups. All live now in Realtime API (EU residency supported) and Playground; test combinations for hybrid agents.

Token/Minute Pricing for Scalable Deployment

GPT-Realtime-2: $32/M audio input tokens ($0.40 cached), $64/M output. Translate: $0.034/min. Whisper: $0.017/min. Enterprise privacy applies; low costs suit high-volume voice products over text-only.