Scaling Coding Agents: Lessons from Building Langfuse Skills

The Problem: Stale Context and Non-Optimal Agent Behavior

When using coding agents like Claude Code to integrate infrastructure tools like Langfuse, developers often face a "stale context" trap. Agents rely on pre-training data, which leads to hallucinations regarding API interfaces that have evolved. Furthermore, agents often prioritize speed over correctness, implementing instrumentation incorrectly, realizing the failure, and then only fetching documentation as a secondary correction step. This creates inefficient, multi-turn workflows that lack visibility into the agent's actual decision-making process.

Building a Reliable Skill

To solve this, the Langfuse team developed a "skill"—a formalized shortcut for agents. Instead of forcing agents to crawl hundreds of pages of documentation, they implemented:

Natural Language Search Endpoint: Rather than relying on generic web searches, the agent queries a dedicated search endpoint that returns relevant documentation chunks. This allows the team to track what problems users are actually encountering, serving as a feedback loop for documentation gaps.
Agent Sitemap: Exposing a sitemap helps the agent navigate the documentation structure efficiently without wasting tokens on irrelevant pages.
Markdown-First Content Negotiation: Ensuring the agent requests and receives documentation in Markdown format prevents token waste and parsing errors associated with HTML.
Reference-Based Context: To avoid the "local cache" problem where duplicated documentation becomes stale, the skill is designed to point to live references rather than embedding static content.

The Role of Evaluation and Target Functions

Even a basic evaluation setup—comparing the file system state before and after the agent runs—is superior to having no evaluation at all. The team used natural language checks (LLM-as-a-judge) to verify that instrumentation was correctly injected into sample repositories.

However, the team discovered that target functions are critical. When they used auto-research to optimize the skill, the agent attempted to minimize the number of turns. This caused the agent to "optimize out" the documentation-fetching steps because it believed it already knew how the system worked. This backfired, as it removed the safety mechanism that ensures the agent uses up-to-date information. Defining the right target function—balancing speed with the necessity of verification—is the primary challenge in agentic development.

Key Takeaways

Look at the traces: Runtime execution traces provide 80% of the insights needed to improve agent performance. Don't over-engineer evaluation until you have manually inspected what the agent is actually doing.
Avoid static duplication: Do not bake documentation into the agent's prompt or skill definition. Point to live search endpoints to ensure the agent always has access to the latest API changes.
Define target functions carefully: If you optimize for "fewer turns," the agent will likely skip critical verification steps. Ensure your target function includes success metrics like "correct instrumentation verified by traces."
Use search as a signal: A search endpoint is not just for the agent; it is a telemetry tool for you. Track the queries to identify where your documentation is failing users.
Default to interactive discovery: Don't assume user environment variables (like data regions). Prompt the agent to ask the user for configuration details rather than hardcoding defaults that may be incorrect for enterprise users.

Notable Quotes

"The resulting trace captures two LLM calls with no visibility into what the agent actually did." (Context: Describing the failure of standard agentic instrumentation where the 'why' is lost.)
"If you basically ask to minimize the number of turns, then our agent that tried to optimize the skill just took out all of the notes that we had to fetch documentation." (Context: Explaining how a poorly defined target function can lead to the agent 'optimizing' away reliability.)
"Dynamic content should be referenced because there's a huge incentive for developers to just contribute a lot of context to the skill, but then it goes out of date." (Context: Warning against the common pitfall of duplicating documentation into the agent's local environment.)
"We always defaulted to Europe and now we kind of like for an agent, like adding another environment variable, they don't care—it's not effort for them." (Context: Highlighting that agents can handle more complex configuration than human users, so don't simplify to the point of inaccuracy.)

The Problem: Stale Context and Non-Optimal Agent Behavior

Building a Reliable Skill

The Role of Evaluation and Target Functions

Key Takeaways

Notable Quotes

More from AI & LLMs

Google's Auto-Diagnose: LLM Diagnoses Test Failures at 90% Accuracy

Slash LLM Token Costs 10x by Fixing 6 Bad Habits

Optimize Claude Limits: Plan, Remember, Pick Models Wisely

Claude.md Patterns for Bulletproof AI Coding