Core Capabilities Unlock Responsive Voice Agents

GPT-Realtime-2 shifts voice AI from basic speech wrappers to full-duplex agents with GPT-5-class reasoning, supporting mid-conversation tool use, interruption handling, and recovery phrases like "I'm having trouble with that." Key controls include adjustable reasoning effort (minimal, low/default, medium, high, xhigh) for latency trade-offs—1.12s time-to-first-audio at minimal vs. 2.33s at high—and preambles for natural flow (e.g., "let me check that"). Parallel tool calls add transparency with audible updates like "checking your calendar," while 128K context (up from 32K) and 32K max output tokens sustain long sessions. Domain-specific retention improves for terminology, proper nouns, and healthcare vocab, with controllable tone (calm, empathetic, upbeat). Inputs handle text, audio, and images, making it ideal for production agents in support, robotics, or hands-free control.

Benchmark Dominance Validates Production Readiness

Independent evals confirm leadership: GPT-Realtime-2 scores 96.6% on Big Bench Audio speech-to-speech (15.2% bump over realtime-1.5's 81.4%, nearing saturation), 96.1% on Conversational Dynamics for pause/turn-taking, and tops Scale AI's Audio MultiChallenge S2S with instruction retention jumping from 36.7% to 70.8% APR. Enterprise tests show 42.9% helpfulness gain (Glean) and 26% effective conversation rate uplift (Genspark), with fewer drops. Pricing holds steady at $1.15/hour input and $4.61/hour output, prioritizing usability over voice quality alone.

Companion Models and Integrations Accelerate Use Cases

GPT-Realtime-Translate enables live dubbing from 70+ input to 13 output languages (e.g., Vimeo's no-prep captions), while GPT-Realtime-Whisper streams low-latency transcription for captions/notes. Demos span Genspark's call agents, voice-controlled dashboards (intent like "Focus on Apple"), game agents with subagents, and robotics queries. OpenAI's prompting guide stresses state management, entity capture, unclear audio recovery, and tool UX—design voice apps as stateful systems with latency budgets, not stateless endpoints. ChatGPT voice upgrades pending, but API availability empowers devs now for translation, meetings, and browser control.