Closed-Loop Audio Drives Contextual Adaptation
Traditional TTS fails conversations because it ignores user audio signals like sarcasm in "okay, fine" or pacing after a joke. TTS-2 fixes this by inputting full prior-turn audio directly, carrying tone, emotion, and rhythm across exchanges without developers adding prior_audio fields. This automatic context makes responses feel attentive: relieved after good news, somber after bad. Run it in persistent Realtime sessions for seamless flow.
Expressive Controls via Tags and Prompts
Steer output with four integrated capabilities: simple tags like [sad] or [excited]; natural English prompts such as [speak sadly, as if something bad just happened]; paralinguistic acts including [laugh], [sigh], [breathe], [clear_throat], or [cough]; and disfluencies like uh, um, self-corrections, or mid-phrase pauses that vary by speaker profile (e.g., energetic vs. hesitant fillers). Clone voices in two steps: upload 5–15 seconds clean audio to /voices/v1/voices:clone, get ID, then use normally for consistent identity across languages.
Full Pipeline for Low-Latency Voice Agents
TTS-2 slots into Inworld's stack: Realtime STT profiles user (age, accent, pitch, emotion, pacing) in one pass; Router selects from 200+ models based on context; all over single WebSocket with sub-200ms median TTS time-to-first-audio. Prior version (TTS 1.5) tops Artificial Analysis Speech Arena leaderboard over Google (#2) and ElevenLabs (#3), proving quality baseline while TTS-2 advances behavior.