Observability and Evaluation for AI Agents in Production

AI AgentsMar 29, 20267 min readDoreid Haddad

In this article

Cleanlab's 2025 survey of 95 engineering leaders running AI agents in production found two facts that should sit next to each other on the same slide. First: less than one in three teams are satisfied with their observability and guardrail solutions. Second: 63% of production teams plan to invest more in observability and evaluation in the next year, more than any other category. The gap is the signal. Production AI agents are running at scale, the existing tooling isn't enough, and teams know it.

This article is the blueprint for what to actually build. Microsoft's AI Agents in Production observability guidance and Google Cloud's dev guide converge on the same four layers. Here's each one, what it costs, and what it catches.

Layer 1: Per-run tracing

Every agent invocation produces a trace — a record of every step the agent took during a single run. The trace shows the input received, the prompt sent to the model, the model's output, every tool call (with input, output, latency, success/failure), every retry, the final structured output, and the total elapsed time.

Microsoft's guidance puts it directly: "Observability provides an audit trail of agent actions and decisions." The audit trail is what lets you debug why a specific case went wrong three weeks ago, defend a decision to a regulator, or replay a problematic run with new prompt logic to see if the change fixes it.

Tools that handle this layer in 2026: Langfuse (open-source, common choice for AI-native observability), Honeycomb or Datadog APM with custom OpenTelemetry instrumentation, or homemade structured logging shipped to your existing log infrastructure. For low-volume agents, structured logs with full per-trace JSON are sufficient. As volume grows, dedicated AI tracing tools earn their seat because per-trace search and replay become slow on raw logs.

Layer 2: Evaluation pipeline

Tracing tells you what the agent did. Evaluation tells you whether what it did was correct. You need both. The MIT NANDA finding that 95% of agent pilots fail correlates with skipping the second one — teams have logs, but no held-out evaluation against expected behavior, so they can't tell whether quality is drifting.

The pipeline shape is consistent across mature production teams:

A held-out eval set of 50-200 examples (input + expected output + grading rubric)
Automated runs on a cadence (daily or per-change is common)
Programmatic grading where possible, LLM-as-judge for fuzzy cases (validated against human grades), human grading on the most important examples
A pass-rate score on the dashboard, broken out by category (routing accuracy, draft quality, escalation logic, etc.)
An alert when pass rate drops more than 3 percentage points from the trailing 30-day average

The Berkeley/Stanford/IBM MAP study found that 74% of production teams depend primarily on human evaluation. That number reflects how early in the maturity curve evaluation tooling still is — automated grading hasn't replaced human judgment yet, especially on quality dimensions that are hard to formalize. The right pattern in 2026 is hybrid: programmatic checks where possible, LLM-as-judge where validated, human review on a sample. All three contribute to the dashboard.

Stanford CRFM's Holistic Evaluation of Language Models work has documented the limits of LLM-as-judge in detail. It works on narrow rubrics, fails on subjective ones. Validate the judge against humans before relying on it.

Layer 3: Cost tracking per task

Token spend by itself isn't useful. Token spend per completed task is. The metric you want on the dashboard: total cost (input tokens + output tokens, weighted by model price) divided by the number of tasks the agent completed. Watch this number weekly. If it goes up without volume going up, you have a leak — usually a runaway loop, a context inflation, or a route to a more expensive model than the case warrants.

Break this metric out by category where you can. Cost per support ticket. Cost per invoice processed. Cost per research brief. The decomposition is what surfaces optimization opportunities — easy cases that are routing to the frontier model when they shouldn't, prompts that grew too long over time, retrieval that's pulling more context than the task needs.

The Cleanlab AI Overview specifically notes that "separating tasks into simple (code-based) and complex (LLM-based) paths can reduce API costs by up to 60%." That's the kind of optimization the cost-per-task dashboard surfaces. Without it, optimizations stay theoretical.

Layer 4: Failure-mode alerting

The fourth layer is alerting on the specific failure modes agents have. These are different from infrastructure alerts (server down, queue backed up) and need their own thresholds.

Iteration count above hard cap. Set a hard limit on iterations per agent run (10 is a reasonable default). Any run that hits the cap is a runaway loop. Alert immediately and route the case to a human queue. This is the single most impactful alert because runaway loops can burn meaningful money and produce confidently wrong outputs in equal measure.

Tool error rate above threshold. When the underlying API for a tool starts failing — rate limits, network issues, schema changes — the agent's error rate climbs fast. Alert when any tool's error rate exceeds 2% over an hour. Implement a circuit breaker: after three consecutive failures, stop calling the tool for a minute and route the case to a human.

Schema validation failure rate above 1%. When more than 1% of agent outputs fail schema validation, something has shifted — a new model version, an upstream prompt change, a category of input the prompt doesn't handle. This is an early-warning signal for drift. Alert before the eval pass rate moves.

Eval pass-rate drop above 3 points. The slow-burn version of the schema alert. Pass rate dropping in the daily eval is the leading indicator that something fundamental changed. Investigate before customers notice.

Latency cliff at p95. When p95 latency jumps without a volume change, the underlying model has slowed down (provider issue) or the agent is doing more iterations than usual (prompt issue). Either way, the user shouldn't be the one finding out.

What this stack costs

Realistic numbers for a low-to-mid volume agent (under 10,000 runs per day):

Per-run tracing: $0-500/month depending on tool. Langfuse self-hosted is essentially free; cloud Langfuse or Honeycomb runs $100-500.
Eval pipeline: largely engineering time. The compute to run evals is small (a few thousand model calls a day at most). Tooling is mostly open-source.
Cost tracking: built into your existing observability if you're capturing token counts in traces.
Alerting: PagerDuty, Opsgenie, or your existing incident tooling. Marginal cost.

The combined operational cost is usually under $500/month for the observability stack on a mid-volume agent. The Cleanlab data shows teams under-invest by a factor of two to three — $200/month observability stacks watching $5,000/month agent operations. Wrong ratio. Bring observability up to roughly 10% of total agent operational cost and the dividends show up in fewer surprise incidents.

The build sequence

If you're standing up observability for an agent that's already in production:

Week one. Add per-run tracing. Capture every model call, tool call, retry, and final output. Ship traces to whatever your team already uses for logs.

Week two. Build a 50-example eval set from the production traces you've now collected. Set up automated daily runs. Put pass rate on a dashboard.

Week three. Add cost-per-task tracking. Find the easy optimization. Almost every agent has at least one — usually a routing change or a context trim.

Week four. Set up the four critical alerts (iteration cap, tool error rate, schema validation failures, eval pass-rate drop). Test each one by deliberately inducing the condition.

That's the four-layer stack in roughly a month, working alongside the agent it observes. Which is exactly the sequence Cleanlab's surveyed teams wish they'd run six months earlier. The retroactive observability project is more expensive than the proactive one because incidents happen during the gap.

Most agent projects fail not because the model was wrong but because the team had no way to tell whether the model was right. Observability and evaluation are how you know. Build the boring four-layer stack. The rest of the AI work gets a lot easier when you can see what your agent is actually doing.

Frequently Asked Questions

What's the difference between observability and evaluation for AI agents?

Observability tells you what the agent did on each run (traces, logs, metrics). Evaluation tells you whether what the agent did was correct (against a held-out eval set). You need both. Most teams have some observability and skip evaluation, which is why drift catches them off-guard.

What's the cheapest acceptable observability stack?

Structured logging to your existing logging infrastructure plus an open-source LLM tracing tool like Langfuse, plus a daily-running eval pipeline that posts pass-rate to a dashboard. Comfortably under $200 a month for low-to-mid volume agents. Dedicated commercial AI observability products earn their place at higher scale or in regulated environments.

How fast should observability alerts fire?

Infrastructure alerts (tool error rates, latency cliffs, iteration-cap breaches) should fire within minutes. Quality alerts (eval pass-rate drops, schema validation failure rate) can fire on the eval cadence — usually daily or per-change. Customer-impact alerts (failed escalations, repeat customer complaints) should fire immediately.

Sources

Microsoft — AI Agents in Production: Observability & Evaluation
Cleanlab — AI Agents in Production 2025
Google Cloud — A dev's guide to production-ready AI agents
arXiv (UC Berkeley, Stanford, IBM) — Measuring Agents in Production
Stanford CRFM — Holistic Evaluation of Language Models
NIST — AI Risk Management Framework

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.