Build Long-Term Multimodal Memory for Personalized Agents

Separate Session and Memory Services to Handle Short- vs Long-Term Recall

Session services manage active chats, enabling resumption of live conversations with short-term state like session and user profile that survives restarts. Memory services act as a long-term archive, processing and storing facts from multiple conversations and multimodal inputs (text, images, audio, video) for semantic search. Avoid mixing them: sessions handle working memory during chats; memory services build a persistent knowledge base. For testing, use simple in-memory service with keyword search (doesn't persist across restarts). For production, deploy What's AI memory bank service, which uses cloud storage, Gemini for fact extraction, and embeddings for meaning-based retrieval—e.g., querying "two-wheeled vehicle" matches "bicycle" notes.

Configure the memory bank via Agent Engine by selecting: (1) a fact-extraction model (e.g., Gemini) to pull key details from content, and (2) an embedding model for semantic indexing. Define topics like "user preferences" or "travel experiences" to organize storage. This turns raw inputs into a searchable service, not just a database table.

Ingest Sessions or Direct Media to Build Knowledge Base

Archive full conversations at session end with addSessionToMemoryBank(), which processes user messages, agent replies, and media references (images, videos, audio) to extract and store facts automatically. Alternatively, upload directly: send files with text context via preloadFromFile() or API calls to generate facts on-the-fly, even outside chats. Both methods create a multimodal knowledge base spanning days/weeks, enabling agents to recall user-shared details like historical buildings from photos or seaside enjoyment from videos without manual tagging.

Auto-Retrieve Facts with Preload Tool for Personalized Responses

Attach the preload memory tool to your agent—it activates at every turn's start, semantically searches the bank using the new user message, injects top relevant facts into the prompt, and requires no custom agent logic. In demos: Session A ingests photo (historical building), video (sea), audio (town); facts stored as "likes historical architecture," "enjoys seaside," "visited town." New Session B query "suggest cultural destination based on prior picture/video/audio" triggers retrieval, yielding tailored recommendations like architecture-focused seaside spots. This achieves multimodal long-term recall, layering atop short-term session/state for fully context-aware agents. Check video description for setup code and demos.