API Access and Core Capabilities

Use the standard Gemini API with model ID gemini-3.1-flash-tts-preview to generate speech audio files exclusively—no text output. This enables prompt-directed control over voice generation, producing high-fidelity audio tailored to complex scenarios like radio broadcasts.

Structured Prompting for Voice Control

Build prompts with these layered sections for precise audio output:

  • AUDIO PROFILE: Name and tag the voice (e.g., Jaz R., "The Morning Hype").
  • THE SCENE: Set vivid context (e.g., 10:00 PM London studio, ON AIR light, mixing desk chaos) to influence delivery energy.
  • DIRECTOR'S NOTES: Specify:
    • Style: Techniques like "Vocal Smile" for bright, inviting tone via raised soft palate.
    • Dynamics: High projection, punchy consonants, elongated vowels on key words (e.g., "Beauuutiful").
    • Pace: Energetic, bouncing cadence matching fast music, no dead air.
    • Accent: Regional origin (e.g., Brixton Estuary, Newcastle, Exeter Devon).
  • SAMPLE CONTEXT: Position the voice (e.g., Top 40 radio standard with infectious energy).
  • TRANSCRIPT: Mark delivery tags like excitedly or shouting in the script.

This structure yields consistent, character-driven speech; modifying accent alone shifts phonetics dramatically (Brixton to Newcastle produces thicker Geordie tones, Exeter a softer West Country lilt).

Rapid Prototyping with Vibe-Coded Tools

Generate custom UIs for testing via Gemini 3.1 Pro prompts, as in the shared notebook (gemini.google.com/share/dd0fba5a83c4), producing shareable tools like tools.simonwillison.net/gemini-flash-tts. This accelerates iteration on prompts without custom coding, ideal for experimenting with accents and styles before production integration.