Redis Memory Splits for Fast Voice AI Agents
Use Redis Agent Memory Server's working/long-term split, parallel fetches, bounded retrieval (top 1 of 5, <200 chars), and semantic routing to make voice AI feel personal and responsive under 2s latency.
Working/Long-Term Memory Split Prevents Noisy Context
Store journal entries as episodic long-term memories tied to user_id, session_id, namespace="voice-journal", and topics="journal", "voice_entry" to survive sessions: ClientMemoryRecord(text=transcript, memory_type=MemoryTypeEnum.EPISODIC, ...); await client.create_long_term_memory(memories=[memory], deduplicate=True). Keep conversational back-and-forth in session-scoped working memory only. This filters voice filler like pauses and corrections, avoiding noisy retrieval that slows generation or confuses responses. Retrieval uses semantic search with filters={"namespace": {"eq": "voice-journal"}, "user_id": {"eq": user_id}}, limit=5, distance_threshold=0.8, then take only top result truncated to 200 chars: text = memories[0].get("text", "")[:200]. Result: focused context for voice replies needing one relevant anchor, not full history dumps.
Parallel Async Fetches and Streaming Slash Perceived Latency
Fetch conversation_context, long-term memories, and calendar_context concurrently via asyncio.gather(fetch_conversation(), search_memories(), fetch_calendar()) so users experience total delay as one short pause, not sequential waits. Use streaming STT/TTS APIs: for TTS, async with self.async_client.text_to_speech_streaming.connect(model="bulbul:v3", ...): await ws.convert(text); async for message in ws: yield base64.b64decode(message.data.audio). This delivers first audio byte faster than full synthesis, making assistants feel alive since voice UX prioritizes time-to-first-sound over total completion. Intentionally limit responses to 1-2 sentences to cut model/ TTS time and maintain conversational rhythm—long replies after pauses feel dumb.
Semantic Routing Bypasses LLM for Fast Intent Detection
Route utterances ("log this", "what I said yesterday", "calendar") with RedisVL semantic router instead of LLM classification, reserving model cycles for responses. This keeps pipeline top fast before memory/retrieval. Redis shines via namespaces for isolation, user_id filtering, episodic/semantic types, and semantic retrieval over keywords, treating memory as performance-sensitive context service. Tradeoff: bounded retrieval risks missing edge cases but ensures concise prompts for voice scale.