Supertonic v3: 99M-Param On-Device TTS Beats Cloud Rivals

Achieve Production TTS Without Cloud Dependencies or Preprocessing

Supertonic v3 generates speech from text entirely on-device using public ONNX assets (99M parameters total, 404MB disk footprint), supporting 31 languages plus 'na' fallback for unknowns. Expanded from v2's 5 languages (English, Korean, Spanish, Portuguese, French), it now covers Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, Vietnamese. Install via pip install supertonic; first run auto-downloads from Hugging Face. Synthesize with 10 lines of Python:

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")  # M1-M5 male, F1-F5 female
text = "A gentle breeze moved through the open window."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Outputs 16-bit WAV; supports batching. Use Supertone's Voice Builder (GitHub) to train custom voices from recordings for edge deployment. v3 reduces v2's repeat/skip failures and boosts speaker similarity, with WER/CER competitive to 0.7B-2B models like VoxCPM2 despite 100x smaller size.

Embed Prosody and Handle Tricky Inputs Natively

Insert expressions inline: <laugh>, <breath>, <sigh> for natural prosody without extra models or steps—e.g., text = "I can't believe it <laugh> that actually worked!". Built-in normalization reads complex forms correctly: "$5.2M" as "five point two million dollars", "(212) 555-0142 ext. 402", "4:45 PM on Wed, Apr 3, 2024", "2.3h" as "two point three hours", "30kph" as "thirty kilometers per hour". In benchmarks, all four rivals (ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, Microsoft) failed these; Supertonic v3 succeeds out-of-box, no G2P or phonetics needed.

Infer Fast on Edge Hardware, CPU-Only

Flow-matching architecture (2 inference steps) with speech autoencoder, text-to-latent mapper, duration predictor, LARoPE for alignment, and Self-Purifying Flow Matching for noisy data robustness enables 0.3x RTF on Onyx Boox Go 6 e-reader (airplane mode, CPU). No GPU required; beats larger models' A100 GPU speeds/memory. Runs on 11 platforms: Python SDK, Flutter (macOS), .NET 9, Go, web via onnxruntime-web (WebGPU/WASM). Trade-off: Fixed voices in public assets; custom via Voice Builder. Ideal for voice UIs, accessibility, local apps where privacy/low-latency matter over infinite voice variety.

Achieve Production TTS Without Cloud Dependencies or Preprocessing

Embed Prosody and Handle Tricky Inputs Natively

Infer Fast on Edge Hardware, CPU-Only

More from AI News & Trends

VibeVoice: Free 90-Min TTS Beats ElevenLabs Quality

82M Kakoro TTS Beats Cloud APIs on CPU

Claude Code Leak Reveals Sloppy Code and Risks

637MB LLM Runs Offline on Base MacBook Air, Works Surprisingly Well