Semantic Caching Cuts AI Agent Latency 91% via Intent Matching

Match Query Intent, Not Strings, for 30-40% Hit Rates

Enterprise AI support agents face repeated intents like EMI bounce penalties phrased differently (e.g., "what’s the penalty if my EMI bounces?" vs. "will I get charged if my account doesn’t have enough funds?"), wasting LLM calls. Traditional exact-match caching yields only 2-5% hits since users rarely repeat strings verbatim. Semantic caching embeds queries into 1536D vectors (using text-embedding-3-small), computes cosine similarity against cached embeddings in Redis, and serves responses if similarity ≥0.75 (converted from Redis cosine distance: similarity = 1 - distance). This captures semantic equivalence: identical intents yield ~0.95 similarity (small vector angle), unrelated ~0.28 (near-perpendicular). Result: 30-40% queries answered from cache without LLM inference, directly cutting costs.

Build Branching Pipeline with LangGraph State and Redis KNN

Use LangGraph's StateGraph with TypedDict CacheState (query, embedding, cached_response, llm_response, cache_hit) for nodes: embed_query (OpenAI embedding), similarity_search (Redis FT.SEARCH KNN 1 on FLAT/HNSW vector index, DIM=1536, COSINE metric), conditional route (cache_hit → END else → call_llm → update_cache), and update_cache (HSET hash with query prefix, response, embedding). Schema: TextField("query"), TextField("response"), VectorField("embedding", FLAT, FLOAT32, DIM=1536, COSINE). Benchmarks on 15 queries/5 intents show cold-start ~5s LLM latency vs. warm-cache <0.5s (91% improvement). Reuse embedding across nodes; KNN 1 finds top match, threshold decides hit.

Tune Threshold with F1 Score to Balance Precision/Recall

Threshold trades precision (correct cache-served responses) vs. recall (queries served from cache). High threshold (e.g., 0.8): perfect precision, low recall. Low (0.5): high recall, false positives (e.g., loan closure query matching EMI bounce at 0.52 similarity). F1 = 2 × (precision × recall) / (precision + recall) peaks at optimal (punishes imbalance: 100% precision/0% recall = F1=0). Plot on 20-30 labeled pairs (paraphrases vs. different intents); pick F1 peak, shift up for high-risk domains (finance). Example: EMI seed matches 0.71 paraphrase (hit), rejects 0.52/0.63 unrelated/edge.

Harden for Scale: TTL, Normalization, Invalidation

Tag entries by category for TTL (quarterly products, daily policies). Normalize queries (lowercase, fix typos) pre-embedding to boost hits. Add session context for multi-turn. Trigger invalidation on product/policy changes. FLAT fine <100k entries; scale to HNSW. This shifts agents from cost-prohibitive pilots to viable production at thousands of users.

Match Query Intent, Not Strings, for 30-40% Hit Rates

Build Branching Pipeline with LangGraph State and Redis KNN

Tune Threshold with F1 Score to Balance Precision/Recall

Harden for Scale: TTL, Normalization, Invalidation

More from AI & LLMs

Modular LLM Agent: Skills, Registry, Dynamic Routing

Multi-Agent AI Pipeline for Systems Biology Analysis

Build Multimodal Qwen 3.6 Agents with Thinking & Tools

Refactoring a Sales Agent to Production with ADK & Vectors