Autoregressive Transformers Generate Audio Frames for Streaming

Modern TTS architectures mirror LLMs by treating audio as token sequences in an autoregressive decoder-only transformer backbone, outputting one audio frame (80ms, ~12 frames/sec) per step instead of raw samples. This enables streaming: emit the first frame immediately to play audio while generating the rest, slashing perceived latency in voice agents. Mistral's 4B-parameter open-weight TTS model demonstrates this—17ms from text input to first playable audio packet on a single GPU. Historical shifts from sample-by-sample (slow) or full-waveform generation (high latency) to frame-wise autoregression prove dominant because transformers excel at sequence modeling, but require compression to avoid thousands of steps.

Demos show real impact: clone a voice from 3-5 seconds of reference audio (e.g., Paul's voice recreates intonation across languages, preserving French accents), then stream agent responses like querying conference schedules. Agent pipeline: STT → fast LLM → streaming TTS, where partial audio plays before full generation, enabling natural conversations despite ongoing compute.

Neural Audio Codecs Compress 200kbps to ~500 Tokens/Second

Raw audio's 200kbps bitrate overwhelms transformers (text tokens carry ~10 bits each), so neural codecs encode 80ms frames into 37 tokens (~500 tokens/sec total, vs. speech-to-text's mere 15 bits/sec semantic info). Train codecs via reconstruction losses, adversarial training, and bottlenecks to discard noise while retaining timbre, prosody, and text-reconstructible semantics. Decoder reverses this: transformer predicts frame tokens, then expands to audio.

Variants optimize compute: most labs use per-frame steps with a small decoder transformer recomputing all frame tokens autoregressively; Mistral uses flow-matching (diffusion-like) to generate 37 tokens parallel per frame. Comparison: text captioning drops 99.99% info to bits/sec; codecs retain acoustics for cloning without semantic loss. Result: high-fidelity output (e.g., clone speaker's ego-peak delusion voice for self-debate) at feasible scale.

Conditioning and Latency Wins in Agents, Plus Open Challenges

Provide full context upfront (reference audio + text) for offline TTS; decoder conditions via cross-attention or prefixed tokens. Voice cloning embeds short clips to infer identity, prosody, even cross-language accents—pushing "vocal identity" as branding like visual design.

For agents, stream LLM text tokens directly to TTS (not wait for full response) via interleaving text/audio streams or dual-stream blending, avoiding stitch discontinuities. This yields next latency win: voice partial LLM output instantly, critical for long responses (e.g., full-page text). Trade-offs: no consensus winner yet; Mistral plans exploration (delayed sequence modeling?). Voice cloning encoder proprietary for now—use open base voices. Overall, speech interfaces amplify capable LLMs without rebuilding agents.