Agents Fail on Context Growth, Not Prompts
AI agents like Alyx (built on Arize observability traces) enter vicious loops: spans grow with tool calls/metadata, hit context limits, fail, retry, and accumulate more data. This self-analysis trap—agent operating on its own traces—demands strategic context engineering over prompt tweaks. Key claim: Agents succeed by remembering essentials (head for setup, tail for recency) and forgetting noise, prioritizing product/UX impact since poor context yields unusable outputs.
Naive truncation (first 100 chars only) works briefly but shatters reasoning—follow-ups treat prior inputs as new, e.g., "tell me more about input B" fails post initial analysis. Summarization hands control to LLMs, yielding inconsistent results without guaranteeing key details survive.
Head/Tail Truncation + Memory Escapes Limits
Preserve head (first 100 chars: system prompt, early setup) and tail (last 100 chars: latest tool results, recency) in active context; compress/store middle (duplicates, long tool calls) in a retrievable memory store via IDs/previews. Agent pulls specifics as needed, avoiding resets and enabling control. This held for months in production, mirroring Claude Code's truncation/compression—validating no "secret sauce" beyond basics.
Context defines model input; memory defines persistence. Trade-off: Heuristics (fixed 100 chars) lack principled budgets/metrics, relying on evals for validation.
Scale with Sub-Agents and Long-Session Evals
Long chats (now 20+ turns vs. initial <10) accumulate failures late; users resist restarts, traveling app pages in one session. Counter with sub-agents: Main agent keeps light chat/light context, delegates data-heavy tasks (e.g., searching 100s of spans) to subs holding full history/results. Post-task, results flow back; main retrieves from memory if needed. Keeps main context lean, handles search/query overload.
Test via long-session evals: Load 10 turns, probe 11th for recall—surfaces bugs proactively vs. user reports. Ongoing: Build long-term memory for cross-chat references; huge inputs still hit provider limits, pushing more sub-agent splits. Context selection remains heuristic; future needs quality metrics.