Slash Claude Costs 90% with Prompt Prefix Caching

Cache prompt prefixes in Anthropic's Claude API to process repetitive static content at 10% of base input cost on hits, with automatic mode for chats and explicit for control—minimum 1024-4096 tokens per model.

Implement Automatic Caching for Multi-Turn Chats

Add a top-level cache_control: {"type": "ephemeral"} to your Messages API request to automatically cache up to the last eligible block (tools > system > messages order). In growing conversations, each request reads prior cache (up to 20 blocks back) and writes new prefix, moving the breakpoint forward without manual updates. For example, Request 1 caches system + user1 + asst1 + user2; Request 2 hits cache through user2, processes asst2 + user3 fresh, then caches up to user3. Use ttl: "1h" for longer 1-hour lifetime at 2x base input price. Combine with explicit blocks on static system/tools for hybrid control, limited to 4 breakpoints total. Edge cases: skips if last block ineligible, errors on TTL mismatch or slot exhaustion.

Explicit breakpoints via cache_control on specific blocks give precise control: place on last identical static prefix (e.g., end of tools/system/examples before varying user input). System checks hash at breakpoint, then looks back ≤20 blocks for prior writes—never auto-caches unwritten positions. Mistake to avoid: breakpoint on changing content like timestamps causes full reprocess; fix by marking stable prefix end. Multiple breakpoints (max 4) cache layers independently (e.g., tools rarely, context daily), restarting lookback at each to hit older writes beyond 20 blocks.

Pricing Delivers 90% Savings on Hits

Cache writes cost 1.25x base input for 5-min TTL ($0.30-$18.75/MTok writes across models like Sonnet 4.6 at $3 base), 2x for 1h ($0.50-$30/MTok); hits/refreshes at 0.1x ($0.03-$1.50/MTok)—stack with batch discounts. Outputs unchanged ($1.25-$75/MTok). Minimums: 4096 tokens (Opus 4.6/4.5, Haiku 4.5), 2048 (Sonnet 4.6, Haiku 3.5), 1024 (others). Below threshold? Processed uncached, no error—pad static content to hit it since reads << fresh inputs. Total inputs = cache_read + cache_creation + input (post-breakpoint only). Example: 100k cached read + 50 new input = $ low cost vs full 100k fresh.

Avoid Pitfalls and Monitor Effectiveness

Cache tools/system/text/images/tool results (user/asst turns); no thinking/sub-blocks directly, but thinking caches indirectly in history. Invalidations: tool changes kill all; web/citations/speed toggle system+messages; tool_choice/images/thinking params hit messages only. Non-tool user content strips prior thinking. Strategies: front-load statics, verify via response usage: cache_creation_input_tokens (writes), cache_read_input_tokens (reads), input_tokens (fresh tail). If both creation/read=0, missed threshold/no hit. Concurrent requests? First writes, others wait. TTL 5min default, refreshes free on hit. Post-2026: workspace isolation (not org). Supports all active Claude models; ZDR eligible.

Summarized by x-ai/grok-4.1-fast via openrouter

8941 input / 1616 output tokens in 15350ms

© 2026 Edge