KAME: Zero-Latency S2S with Real-Time LLM Oracles

KAME fuses fast direct speech-to-speech (S2S) with LLM smarts via asynchronous oracle injections, hitting 6.4/10 on MT-Bench at Moshi's near-zero latency vs. cascaded 7.7/10 at 2.1s delay.

Bridging S2S Speed and LLM Depth

Direct S2S models like Moshi generate audio tokens every 80ms for near-instant responses but sacrifice factual knowledge to model tone, emotion, and rhythm. Cascaded pipelines—ASR to LLM to TTS—deliver frontier LLM quality but add 2.1s median latency by waiting for full user input, disrupting flow. KAME resolves this by running a Moshi-like front-end S2S in parallel with a streaming STT + LLM back-end, injecting partial LLM text responses (oracles) to guide speech output mid-conversation without retraining the front-end for different LLMs.

Asynchronous Oracle Stream for Progressive Correction

KAME's front-end extends Moshi's three-stream transformer (input audio, inner monologue text, output audio) with a fourth oracle stream. As user speech streams in, back-end STT builds partial transcripts sent periodically to an LLM (e.g., GPT-4.1 or Claude-3-Opus), which generates evolving oracle texts—from rough guesses to refined answers. The front-end conditions its speech on these oracles, correcting mid-sentence like humans do. Both modules run independently, preserving zero-latency starts while upgrading responses in real time. Back-end is plug-and-play: swap GPT-4.1 (stronger on humanities) for Claude-3-Opus (better reasoning) or Gemini-2.5-Flash at inference.

Simulated Oracle Training Yields Production Results

Lacking real oracle data, train with Simulated Oracle Augmentation: Use a simulator LLM on 56,582 dialogues from MMLU-Pro, GSM8K, and HSSBench (TTS-converted to audio), generating 6 hint levels (0: unguided guess; 5: ground-truth). On speech-synthesized MT-Bench (reasoning, STEM, humanities), standalone Moshi scores 2.05. KAME + GPT-4.1 hits 6.43; +Claude-3-Opus 6.23—both at Moshi latency. Top cascaded Unmute (GPT-4.1) reaches 7.70 but at 2.1s. Final KAME oracles score 7.79 text-only, proving the gap stems from early speech, not LLM limits. Builders get open weights, inference code, and a back-end-agnostic path to natural voice AI.

Summarized by x-ai/grok-4.1-fast via openrouter

8268 input / 1889 output tokens in 12518ms

© 2026 Edge