Eval Sets for AI Agents: How to Know Yours Is Actually Working

Anthropic's Building Effective Agents makes a quiet point that most "how to build an agent" content skips: the most successful teams are the ones who measure their agent against real examples and iterate from data, not from intuition. OpenAI's practical guide to building agents says the same thing in different words — start with the most capable model to establish a baseline, then test whether smaller models can clear it. Both pieces of advice are downstream of the same prerequisite: you need a way to grade outputs. That grading instrument is called an eval set, and most teams ship without one.
This article is the construction manual. How to build an eval set, where the examples come from, how to grade them, when to run, and the rules that keep evals useful as your agent grows. None of this appears in the current Google AI Overview for "how to build an AI agent" — which is a meaningful gap, because without evals everything else in that AI Overview is just a guess at quality.
What an eval example looks like
For most agents, an eval example is a tuple: input, expected output, optional notes. The input is exactly what your production agent receives — same format, same edge cases, same noise. The expected output is what a human expert would consider a correct response. Notes might capture why this example matters, what edge case it tests, or which past bug this case prevents from recurring.
The trap to avoid is making the input too clean. Production inputs are messy. Customers don't write the way ChatGPT prompt examples are formatted. Documents arrive with weird encoding, missing fields, and noise. If your eval examples look like demo data, your eval scores will look great and your production agent will fail.
Sizing the eval set
Start at 50 examples. Grow to 200 over a few months. Stop adding once new examples mostly duplicate coverage you already have.
The numbers come from a practical tradeoff. Below 50, you don't have enough coverage to detect regressions — a 2-percentage-point drop in pass rate isn't statistically meaningful on a tiny set. Above 200, the cost of running the eval gets high enough that under deadline pressure teams skip it, which defeats the purpose. The sweet spot for production agents is 100-200 examples, weighted toward the cases that actually matter.
For agents in regulated environments (finance, healthcare, anything with audit obligations), the upper end can climb to 500 or 1,000 because you need coverage of every category the regulator might ask about. For everything else, 200 is plenty.
Where the examples come from
Three sources, in priority order.
Real production logs. Take cases the agent has actually seen and grade what it produced. This is the highest-quality signal because it reflects the actual distribution of inputs in production. A working pattern: every two weeks, sample 30 random production cases, grade them, add the borderline ones to the eval set.
Edge cases your team can think of. Brainstorm the inputs that scare you. Malformed data. Weird formats. Ambiguous requests. Adversarial phrasings. These won't appear naturally in production logs because they're rare, but they're disproportionately the cases that produce embarrassing failures. Manufacture them and add them to the eval.
Deliberately curated cases that test specific behaviors. Does the agent route correctly when the customer mentions a specific keyword? Does it escalate the right kind of complaint? Does it refuse the right kind of out-of-scope request? Build these examples on purpose. They're how you turn the eval into a regression test for behaviors you care about.
What to avoid: eval sets built entirely from synthetic examples generated by another LLM. They feel productive because you can write a hundred in an afternoon, but they tend to look like other synthetic examples and miss the messiness of real production inputs. Synthetic data is fine for stress-testing specific behaviors. It's bad as the bulk of the eval.
How to grade
Two main methods, plus a third that's useful but tricky.
Programmatic grading checks an output against a deterministic rule. Did the agent return the correct category from a fixed list? Did it pull the right invoice number from this PDF? Did it route to the right queue? Programmatic checks are fast, cheap, and unambiguous. Use them wherever you can.
Human grading is the alternative for outputs where there isn't one right answer — the quality of a draft email, the helpfulness of a research summary, the appropriateness of a tone. A human reads the output and assigns a score on whatever rubric matters. Slow, expensive, subjective, but the only way to get signal on quality dimensions that resist automation. Use it sparingly and on the most important examples.
LLM-as-judge is the tempting third option — use a model to grade another model's outputs. It works for some cases and fails badly for others. It works when the rubric is clear and the judge is at least as capable as the model being graded. It fails when the rubric is fuzzy, or when the judge is the same model that produced the answer (the grading correlates with the production output, defeating the test). Stanford CRFM's Holistic Evaluation of Language Models work has explored this in depth and shows that LLM judges are reliable on narrow, well-defined rubrics and unreliable on subjective ones. Use it carefully. Validate against human grades before relying on it.
Cadence
Run the eval:
- Every prompt change
- Every model upgrade (provider releases, version bumps)
- Every tool change
- Every reasonable interval, even when nothing changed (weekly is a fine default)
The "even when nothing changed" cadence is the one teams skip. It catches the cases where a model provider quietly updates behavior behind the API, or where the input distribution drifts and an old prompt starts performing worse on real production data. Without the cadence, those drifts only get caught when a customer escalates.
The result of every eval run goes on a dashboard. Pass rate at minimum. Better: pass rate broken out by category — routing decisions, draft quality, escalation logic — so you can see which dimensions are improving and which aren't.
The rules that keep evals useful
A few rules I'd bake into any eval discipline. These are the ones whose absence breaks teams down the road.
Hold out examples from the prompt-tuning loop. Never grade the agent on the same examples you used to write the prompt. The prompt is tuned to those examples by definition. Hold out a separate set the prompts have never seen, and grade against that. Otherwise you're measuring memorization, not capability.
Add examples from production failures. Every time something embarrassing slips out, add it to the eval set. Two years in, your eval set is institutional memory of every way the agent has ever surprised you.
Version the eval set in source control. The test suite is part of the system. It should be reviewed, branched, and merged like any other code. Teams who keep their evals in a Google Doc lose them within six months.
Don't tune to the eval until it stops being useful. If the agent gets 100% on the eval set, the eval has stopped being a useful signal. Add new harder examples. The eval is a measurement instrument; if every measurement reads the same, the instrument is broken.
A worked example
Imagine a support triage agent that classifies inbound emails into one of four categories and either auto-replies or escalates. A working eval set might look like this:
- 50 real production tickets sampled across the four categories, each labeled with the correct category, the appropriate action (auto-reply, draft, escalate, flag), and the expected confidence band
- 20 edge cases: tickets with mixed signals, tickets in non-English languages your agent should escalate, tickets with prompt-injection attempts, tickets with inappropriate or threatening language
- 30 deliberately curated cases testing specific behaviors: VIP customer detection, refund request escalation, churn-risk language detection, billing dispute routing
- 10 negative examples: tickets that shouldn't trigger any action because they're spam or recruiter outreach
That's 110 examples. The team grades programmatically on category and action (deterministic rules), and uses LLM-as-judge plus human review on draft quality. The eval runs every prompt change, every Friday, and every time the model provider releases a new version. Results go on a dashboard with pass rate broken out by category.
That setup catches drift weeks before it shows up in customer complaints, and it makes prompt iteration measurable instead of vibes-based. None of it is glamorous. All of it is what separates production agents that improve over a year from production agents that quietly degrade.
Why this matters
The reason most agent projects plateau or stall isn't that the model wasn't good enough. It's that the team had no way to tell, from week to week, whether the system was getting better or worse. Without evals, every change is a guess and every regression is a surprise. With evals, the work becomes engineering — measurable, debuggable, and improvable. That's what turns an agent from a demo into a product.
If you're starting a project tomorrow, write the eval set before you write the prompt. The eval defines what "working" means. The prompt is a guess at how to achieve it. Build the measurement before the thing you're measuring, and the work that comes after is just iteration. Build the thing first and you'll spend the rest of the project arguing about whether it's good enough.
Most teams skip this step. Most agents fail. Those two facts are not unrelated.
Frequently Asked Questions
How big should my eval set be?
Start at 50 examples. Grow to 200 over a few months. Below 50 you can't detect regressions reliably; above 200 the cost of running the eval gets high enough that teams skip it under deadline pressure.
Where should the eval examples come from?
Real production logs first, edge cases your team can think of second, and deliberately curated examples that test specific behaviors third. Avoid eval sets built entirely from synthetic data — they tend to look like other synthetic data and miss the messiness of real inputs.
Can I use one model to grade another model's outputs?
Sometimes. LLM-as-judge works when the rubric is clear and the judge is at least as capable as the model being graded. It fails when the rubric is fuzzy or the judge is the same model that produced the answer. Validate against human grades before relying on it.
How often should I run the eval?
Every prompt change. Every model upgrade. Every tool change. And on a regular cadence (weekly is reasonable) even when nothing changed, because models behind APIs can shift behavior without notice.
Sources
- Anthropic Research — Building Effective Agents
- OpenAI — A practical guide to building agents
- Stanford CRFM — Holistic Evaluation of Language Models
- UC Berkeley — Berkeley Function-Calling Leaderboard
- NIST — AI Risk Management Framework
- Google Cloud — Patterns for AI agents

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


