5 LLM Pitfalls Engineers Hit Building Agents

Budget Context Windows Like RAM to Avoid Gradual Degradation

Context windows are bounded buffers holding system prompts, conversation history, tool outputs, prior responses, and retrieval chunks—not just documents. Exceeding them truncates silently without errors, making agents lose access to key info. A coding agent reading three medium files plus tool responses burns 30-50K tokens before real work, even on 200K models. Design agents to fetch on-demand rather than stuffing everything in: prioritize what the model needs now. This scales demos to production; ignoring it causes toy-task success but real-work failure.

Tokenize Workloads Precisely for Cost and Limits

Tokens are sub-word fragments from the model's tokenizer, not words/characters. Code eats tokens fast—brackets, underscores, indentation, identifiers push a 200-line Python file to ~3K tokens, not 2K. Non-English (Japanese, Arabic, Hindi) uses 2-4x more tokens than English for same meaning, breaking English-based estimates. JSON/XML schemas add overhead vs. prose. Always run representative samples through the exact tokenizer pre-launch: word-count guesses underestimate costs severely.

Tune Temperature for Reproducibility, Ground Hallucinations Systematically

Temperature controls low-probability token sampling, trading reproducibility for variety—use 0 for tool-calling, extraction, classification, spec'd code gen where correctness trumps creativity. Higher suits diverse generation (variants, brainstorming) but invites irreproducible bugs. Even temperature=0 isn't fully deterministic due to API routing/batching; true control needs seeded inference. View hallucinations as pattern continuation from training data, not recall errors: counter with retrieval grounding, output schemas/function calls, and post-generation validation (critical for actions). In coding agents, models invent APIs confidently—prompting/feedback loops fail; validate rigorously.

Engineer RAG Retrieval, Not Just the LLM

RAG indexes data, retrieves chunks at query time for context. The LLM is easy; retrieval decides success—tune chunking, embeddings, hybrid search, reranking, query rewriting. Diagnose with recall@10 on held-out eval sets: poor retrieval, not the model, causes most failures. Teams blame LLMs without measuring this.