KV Cache Bloats Reasoning Costs Linearly

Reasoning LLMs like OpenAI's o1 and DeepSeek-R1 generate 10,000+ tokens of intermediate thinking per hard problem, keeping every token in the KV cache. This working memory grows linearly with output length, choking GPUs: a single competition math problem on Qwen3-8B hits 10,000+ tokens and ~2.5GB KV cache. Deploying these models at scale fails because dead-end explorations and redundant calculations linger indefinitely, spiking inference costs and limiting batch sizes.

Builders hit walls shipping long-context reasoners—memory exhausts before answers emerge. Standard caching retains full history for attention, but humans solve by noting key results and discarding scratch work.

Teach Models to Chunk, Summarize, and Forget

MEMENTO replicates human note-taking: split reasoning into chunks, compress each into a compact 'memento' (special tokens capturing essence), then discard original verbose tokens. The model attends only to the chain of mementos for subsequent steps, shrinking active context dramatically.

Core technique uses a custom attention pattern: during generation, produce reasoning chunk → auto-generate memento summary → evict prior tokens except memento → repeat. Mementos act as compressed KV states, preserving critical state without full traces. Training leverages a synthetic data pipeline: chunk existing reasoning traces, distill summaries via self-supervision, fine-tune to predict answers using memento-only history.

This drops KV cache to 1/3 original size. On Qwen2.5-7B, a typical problem shrinks from 18.6GB to 6.2GB. No prompt hacks needed—baked into model weights for plug-and-play inference.

Benchmarks Confirm Speedups Without Accuracy Hits

Across model families, MEMENTO holds AIME, MATH, GPQA scores steady while compressing traces 3x. Smaller models like 7B-8B now handle competition-level math that previously OOM'd. Trade-off: extra training compute upfront (synthetic distillation), but inference wins big for production—lower latency, bigger batches, cheaper scaling.

For AI product builders, retrofit via LoRA on open reasoners like Qwen: expect 70% memory savings on long CoT tasks, enabling edge deployment or high-throughput APIs without quantization hacks.