API Access and Core Capabilities
Use the standard Gemini API with model ID gemini-3.1-flash-tts-preview to generate speech audio files exclusively—no text output. This enables prompt-directed control over voice generation, producing high-fidelity audio tailored to complex scenarios like radio broadcasts.
Structured Prompting for Voice Control
Build prompts with these layered sections for precise audio output:
- AUDIO PROFILE: Name and tag the voice (e.g., Jaz R., "The Morning Hype").
- THE SCENE: Set vivid context (e.g., 10:00 PM London studio, ON AIR light, mixing desk chaos) to influence delivery energy.
- DIRECTOR'S NOTES: Specify:
- Style: Techniques like "Vocal Smile" for bright, inviting tone via raised soft palate.
- Dynamics: High projection, punchy consonants, elongated vowels on key words (e.g., "Beauuutiful").
- Pace: Energetic, bouncing cadence matching fast music, no dead air.
- Accent: Regional origin (e.g., Brixton Estuary, Newcastle, Exeter Devon).
- SAMPLE CONTEXT: Position the voice (e.g., Top 40 radio standard with infectious energy).
- TRANSCRIPT: Mark delivery tags like excitedly or shouting in the script.
This structure yields consistent, character-driven speech; modifying accent alone shifts phonetics dramatically (Brixton to Newcastle produces thicker Geordie tones, Exeter a softer West Country lilt).
Rapid Prototyping with Vibe-Coded Tools
Generate custom UIs for testing via Gemini 3.1 Pro prompts, as in the shared notebook (gemini.google.com/share/dd0fba5a83c4), producing shareable tools like tools.simonwillison.net/gemini-flash-tts. This accelerates iteration on prompts without custom coding, ideal for experimenting with accents and styles before production integration.