Agent Drift Creates Invisible Gaps—Observability Closes Them

Agents degrade silently: models update, prompts tweak, edge cases pile up, widening the divide between intended behavior (the "platform") and reality (the "train"). Amy Boyd and Nitya Narasimhan analogize this to London's "Mind the Gap" signs, where platforms stay fixed but trains evolve, risking falls without constant checks. They advocate observability across the lifecycle—build, debug/production, multi-agent fleets—to measure quality, safety, and agentic performance like intent resolution and task adherence.

Key decision: Start evaluations early, not as an afterthought. Without baselines or datasets (common for new agents), manual setup wastes time. Tradeoff: Non-determinism means scores are probabilistic (e.g., 80-90% tool call accuracy), not binary, requiring continuous monitoring over snapshots.

"Agents are non-deterministic. That's not just a problem for demos. That's also a problem for real life when you actually get to production." —Amy Boyd, emphasizing why reliability demands tracing, evals, monitoring, and optimization, not just one-off tests.

Tracing with OpenTelemetry Unifies Multi-Tool Agent Workflows

Build observability from day zero using OpenTelemetry (OTel) standards for tracing. This captures full agent traces: tool calls, messages, decisions across workflows—even multi-agent systems where debugging explodes in complexity.

In Microsoft Foundry, instrument agents built anywhere (e.g., LangChain, Semantic Kernel) and centralize traces in the control plane. No vendor lock-in: OTel exports to Foundry or your tools. For a weather agent example:

  • User query: "What's the weather in London?"
  • Eval 1: Intent resolution (did it detect local weather request?).
  • Eval 2: Tool call correctness (right API? Parameters match?).
  • Eval 3: Task adherence/response quality.

This pinpoints failures: 95% intent match but 70% wrong tool? Tweak prompt there. Results: Developers debug faster, IT admins integrate with Azure Monitor for infra signals.

Tradeoffs: Traces balloon in multi-agent setups (exponential data), so Foundry aggregates fleet-wide views. Fork their GitHub repo (with dev containers for instant Codespaces setup) to replicate: pre-installed tools, notebooks, branches for evolving workshops.

"If you think about the change in trains and technology, but the platform doesn't change. So there's this sign on there because at each station it's different." —Amy Boyd, illustrating how fixed requirements clash with evolving agents, solved by per-step tracing.

Built-in and Custom Evaluators for Quality, Safety, Agentics

Foundry embeds 20+ evaluators, learned from platform-scale customer data:

  • Quality: Coherence, relevance, groundedness.
  • Safety/Risk: Toxicity, bias, PII leakage.
  • Agentic: Intent resolution, tool selection, task completion—holistic over single LLM calls.

No eval data? Custom evaluators chain LLMs (e.g., GPT-4o judges outputs). Run continuous (code changes), scheduled, or ad-hoc batches. Metrics: Percentages reveal drift (e.g., task adherence drops from 92% to 75% post-prompt tweak).

For new projects like a Contoso travel agent (hotels, cars, flights): Too many models (2M+ on Hugging Face, 11K+ in Azure)? No data? Foundry prototypes fast: Define instructions/tools/model, trace first run, eval workflow.

"You have no existing data. This app never existed before. Where do I get the data to even do evals?" —Nitya Narasimhan, highlighting the zero-to-prototype challenge solved by auto-generated datasets in their observe skill.

Observe Skill: One-Prompt Auto-Evals, Optimization, Rollback

Demo steals the show: "Observe skill"—a meta-agent for agents with zero setup.

  1. Point at your agent (e.g., coding agent).
  2. Generates synthetic eval dataset (queries, gold responses).
  3. Batch evals across metrics.
  4. A/B tests prompt versions.
  5. Optimizes (e.g., rephrases for +15% task adherence).
  6. Rolls back to best; shows reasoning/trace.

In travel agent: Surfaces unknown failures like hallucinated bookings or wrong tool chains. Timeline: Minutes, not days. Costs: ~$10 for workshop runs (Azure free tier viable). Repo includes Codespaces dev container: VS Code in browser, deps pre-installed, extensions for notebooks/skills.

Tradeoffs: Relies on judge-model quality (use strong ones like o1); synthetic data misses real edges (pair with red teaming). Evolution: v1 single-agent → multi-agent traces.

This accelerates optimize loop: Data → Insight → Iteration. Nitya: Treat repo as 4-hour workshop; fork all branches for updates.

"The skill shows its reasoning at each step, which is where the real value is: it surfaces the failures you didn't know to look for." —From session description, underscoring transparent auto-debugging over black-box scores.

Red Teaming and Fleet-Scale Monitoring for Production

Safety beyond normal users: Red teaming pits adversarial AI (second agent) against yours with jailbreak prompts, finding vulns pre-launch. Integrates open-source like Pirat repo; one-click in Foundry.

Scale to fleets: Centralized dashboard across many multi-agent systems. Pull cloud monitors (Azure), schedule evals, fleet views. Hosting-agnostic: Build in LlamaIndex, host/observe in Foundry.

Results: Brand protection (quality), user safety (guardrails), quick fixes (e.g., 40% faster debug via traces). Future: Security deep-dive (missed here, but repo/Discord cover).

"Red teaming is not something you do alone." —Amy Boyd, on collaborating via open-source and platform tools to stress-test agents proactively.

Key Takeaways

  • Fork the Microsoft Foundry repo early; use Codespaces dev containers to skip env setup and run 4-hour workshops in minutes.
  • Instrument all agents with OpenTelemetry for portable tracing—centralize in Foundry control plane regardless of build tools.
  • Evaluate workflows end-to-end: Intent → Tool call → Task adherence; use built-ins first, customize for gaps.
  • Deploy 'observe skill' for zero-data starts: Auto-generate evals, optimize prompts, rollback—watch reasoning traces.
  • Red team adversarially; monitor fleets continuously to catch drift before users notice.
  • Start small: Prototype travel/coding agents in repo to learn model/tools/instructions triad.
  • Budget ~$10/run; join Discord for credits/tips—non-deterministic agents demand probabilistic metrics (80-95% norms).
  • Tradeoff honesty: Synthetics great for baselines, but pair with real/user data for production.
  • Evolve: Single → Multi-agent → Fleet; OTel ensures future-proofing.