Skip Client-Side History: Leverage Server-Side State in Interactions API
Gemini Interactions API replaces the older generateContent with a unified interface for models and agents, mirroring OpenAI's chat completions but with built-in server-side state. Start by creating an interaction ID via interactions.create—pass model (e.g., gemini-2.0-flash), tools, and input. Responses include previousInteractionId for follow-ups; just send new user input referencing it. This eliminates appending full chat history client-side, boosting cache hit rates 2-3x (input tokens 90% cheaper on cache hits) since the server preserves exact context without your modifications breaking encodings.
Core loop for tool-using agents:
- Define tools as JSON schemas (e.g.,
read_file(path),write_file(path, content),run_bash(command)). - Send initial
interactions.createwith tools; stream response via SSE. - Check
requiresAction: if true, extract function_call from output.parts, execute locally (e.g., read file returns content string), append tool response as new input withpreviousInteractionId. - Repeat until no more actions—model generates final text.
Trade-off: Server state simplifies loops but limits custom context engineering (e.g., no easy token trimming). Fallback to full input arrays if needed. Supports chaining: run Deep Research agent, then switch to gemini-2.0-flash-exp for image gen on results.
Hands-on coding agent example (Python): Use google-genai SDK. Constructor initializes genai.Client(api_key=...) and model='gemini-2.0-flash'. run() method handles the loop:
import google.genai as genai
class CodingAgent:
def __init__(self, api_key, model='gemini-2.0-flash'):
self.client = genai.Client(api_key=api_key)
self.model = model
self.tools = [ # Define read_file, write_file, run_bash schemas
{'function_declarations': [{'name': 'read_file', 'description': '...', 'parameters': {'type': 'object', 'properties': {'path': {'type': 'string'}}}}]},
# ... bash, write_file
]
def run(self, prompt):
interaction = self.client.interactions.create(model=self.model, contents=[{'role': 'user', 'text': prompt}], tools=self.tools)
while interaction.state == 'requires_action':
for part in interaction.output.parts:
if part.function_call:
result = self.execute_tool(part.function_call) # Your impl: os.read, subprocess.run, etc.
interaction = self.client.interactions.create(
model=self.model,
previous_interaction_id=interaction.name,
contents=[{'role': 'model', 'function_response': {'name': part.function_call.name, 'response': {'content': result}}},
{'role': 'user', 'text': ''}] # Empty user to continue
)
return interaction.output.text
Common mistake: Leaking API keys (e.g., GitHub pushes)—treat as secrets, use env vars. Test with free tier (no credit card). Quality check: Agent should read/write files, run bash reliably without hallucinations; validate tool params before execution.
Accelerate Development: Install Agent Skills for Auto-Code Gen
Manually coding agents wastes time—use agent skills (MCP standard) to let your IDE agent (Cursor, Aider, Claude Code) build them. Run npx skills install @google/gemini-interactions-api (or npx @skills/sh install @google/gemini-interactions-api) in project dir. This pulls GitHub repo google-gemini/gemini-skills, adding docs-aware context: model lists, agents, tool combo (Google Search + custom functions).
Agents auto-fetch linked Markdown docs via web tools, staying current without skill updates. Prompt your IDE agent: "Create CodingAgent class with constructor (genai client, model), run method, tools for file read/write/bash, using Interactions API." It generates the above code, aware of latest features like tool combination.
Test installation: Ask agent "What skills do you have?"—confirms Gemini Interactions skill. Works with Cursor, Aider/Gemini CLI, Cloud Code. Trade-off: Relies on agent's web fetch (similar latency to local file read); skills shine for non-reliable model tasks like exact API syntax.
Before/after: Manual: 30min debugging protos. With skills: 2min prompt → working agent. Prerequisite: API key from ai.google.dev (free, Gmail signup). Fits early prototyping; scale to custom skills for prefs (e.g., always test with Bun).
Real-Time Multimodal Conversations: Gemini Live API WebSockets
For voice/video agents, switch to Live API: Bidirectional WebSocket at wss://live-aio.google.dev/v1/{session_id}. Supports gemini-2.0-flash-live: <500ms latency, native audio/video input, interleaved streaming (audio out + tool calls).
Setup workflow:
- Generate session: POST
/live/sessionswith model. - Connect WebSocket, send JSON config:
session_updatewith instructions/context/tools. - Stream user audio (Web Audio API → Opus encode), receive
response_audiochunks. - Handle tool calls server-side, send back via WS.
- Compress context: Use
context_window_compressionto summarize history.
Demo: Live Jukebox—user speaks song request, agent generates music via tools (e.g., Suno API), streams audio response. Multimodal grounding: Audio input transcribed + analyzed (speaker ID, emotion). Personalization: Load user prefs into session.
Python WebSocket impl snippet:
import websocket, json
ws = websocket.WebSocketApp("wss://live-aio.google.dev/v1/...",
on_message=lambda ws, msg: handle_live_msg(json.loads(msg)) # Parse audio/tools
)
# Send: ws.send(json.dumps({'audio': base64_opus_data}))
Use for customer support (GetYourGuide example): Async polling/webhooks for long tasks. Trade-off: WS connections fragile—use session management, reconnect logic. Quality: Low-latency beats turn-based; test E2E latency <1s.
Real-world: Glasses integration (Vision Claw + Ray-Ban SDK proxies to Live API). Avoid: Long HTTP for >10s tasks—poll or webhooks.
Notable Quotes:
- "Using the serverside state, the server keeps the context. So the chances for your cache hit rate is much higher. And we see like two to three times better cache rates." — Philipp Schmid, on Interactions API benefits.
- "The interactions API is a new API we launched in December and beta which hopefully will succeed generate content soon. It's a unified API to use with models uh with agents." — Philipp Schmid, introducing the API.
- "We have the Gemini interactions API. And here you can either pick the the first command or the second command depending on what you want." — Philipp Schmid, on skill installation.
- "It becomes very helpful when you build agents where you have a loop and always need to append new user input." — Philipp Schmid, on stateful chats.
- "Keeping HTTP requests or connections open for I would say more than like 10 seconds is not a very good practice." — Philipp Schmid, on async execution.
Key Takeaways
- Get free API key at ai.google.dev; install Interactions skill via
npx skills install @google/gemini-interactions-apifor instant agent code gen. - Build tool loops with
interactions.create+previousInteractionId—execute functions locally, repeat till text output. - Prioritize gemini-2.0-flash for coding/agentic tasks; combine built-in tools (Search) with customs.
- For voice: Use Live API WebSockets for <500ms multimodal streaming; compress context for long sessions.
- Cache wins with server state—avoid client history munging; poll/webhooks for async agents.
- Test E2E: File ops, bash, audio in/out; common pit: Tool param validation.
- Prototype fast with IDE agents; productionize with sessions, reconnection.
- Glasses/AR ready: Proxy phone audio to Live API for wearable agents.