Building Gold and Silver Eval Sets for AI Vendor Pilots

Building Gold And Silver Eval Sets Ai Vendor Pilots

AI ConsultingMay 23, 20267 min readDoreid Haddad

In this article

AI vendor pilots without an eval set become demo theater. The vendor brings prepared examples, the system performs well on them, and the buyer decides based on impression rather than evidence. Per Microsoft's data science team and Maxim AI's evaluation guide, the standard practitioner pattern is "grow silver datasets and promote to gold with human review" — silver for breadth, gold for confidence. Per a practical guide on arXiv (2025), the discipline is "proactively curate representative datasets, select meaningful evaluation metrics." The right way to run pilots is with eval sets the buyer constructs, applied fairly across all vendors, scored against pre-agreed criteria.

This article is the methodology for building gold and silver eval sets for AI vendor evaluation, what each is for, and how to use them.

Gold and silver: what each is for

Gold set: 50-100 examples with high-quality labels from domain experts who agreed on the answer. Used for high-confidence comparison between vendors. Each example has a clear "right answer" or quality criterion.

Silver set: 200-500 examples with moderate-quality labels, broader coverage of the input distribution. Used for breadth coverage and statistical comparison across the long tail of typical inputs.

The two complement each other. Gold gives you confidence on specific high-stakes cases; silver gives you confidence on overall behavior across the distribution.

Building the gold set

Step 1: Define the categories. What are the types of inputs your AI system will handle? Typical categories: standard cases (the bulk of inputs), edge cases (unusual but legitimate inputs), adversarial cases (inputs designed to test failure modes), out-of-scope cases (inputs the system should refuse).

For a customer service AI: standard product questions, escalation triggers, harassment, account-specific sensitive questions, attempted prompt injection, off-topic chatter.

Step 2: Source the examples. Three sources:

Real production data (from existing systems, anonymized) — typically 60-70% of the set
Synthetic examples covering known edge cases — 20-30%
Adversarial examples designed to test failure modes — 10-20%

The mix depends on your use case. For customer-facing AI, lean heavily real production. For high-stakes decisions, lean more adversarial.

Step 3: Label with consensus. For each example, have 2-3 domain experts independently produce the expected answer or quality criterion. Where they disagree, discuss and reach consensus or exclude the example. The remaining examples form the gold set.

Inter-annotator agreement (the rate at which independent labelers agree) is itself a useful metric. Below 70% agreement on a category usually means the category is poorly defined and needs more work.

Step 4: Document scoring rubrics. For each example, document not just "right answer" but how to score partial credit, what makes an answer better or worse, what failure modes to penalize specifically.

A typical rubric: factual correctness (0-2), helpfulness (0-2), tone match (0-1), avoidance of disallowed content (0-1). Total 6-point scoring.

Step 5: Validate with the gold set itself. Run a baseline (the simplest possible system, often just a strong prompted foundation model) against the gold set. The baseline should score reasonably (not perfectly, not badly). If the baseline scores 0% or 100%, the eval set is too hard or too easy and needs adjustment.

Building the silver set

Step 1: Source broader examples. 200-500 examples covering the full distribution of expected inputs. Mostly from production data; less curation than gold.

Step 2: Label with single labeler. Each example labeled by one domain expert rather than consensus-of-three. Lower confidence per example, but covers more breadth.

Step 3: Apply lighter rubric. Either binary (correct/incorrect) or simple 3-point scoring. The silver set is for statistical comparison across many examples, not deep per-example analysis.

Step 4: Validate with sampling. Pick 20 silver examples at random and have a second labeler check the labels. If agreement is below 80%, the silver set is too noisy and needs cleaning.

What to do with the eval sets during pilots

Pilot setup: vendor gets the input format, the API specifications, the scoring rubric. Vendor does NOT see the labels (their answers are scored against the held labels).

Pilot execution: vendor's system runs against gold + silver set, producing outputs. Run 2-3 times to measure variance.

Scoring: apply the rubric to each output. Aggregate by category. Compute overall score and per-category scores.

Comparison: lay out the scoring across all vendor pilots. Look at total score, per-category scores, and consistency across runs.

Decision: combine eval set scores with the qualitative scores from the framework. Eval set is technical capability evidence; the framework provides the broader picture.

Common eval set construction mistakes

Mistake 1: building eval set with vendor input. The vendor will (often unintentionally) optimize the set for cases their system handles well. Build independently; share with vendors only after labels are locked.

Mistake 2: skipping edge cases. Standard cases tell you which vendor handles the bulk well. Edge cases tell you which vendor handles the failures gracefully. Skipping edge cases means picking on the easy 80% and missing where the engagement actually fails.

Mistake 3: too few examples per category. With 5 examples in a category, the noise floor is too high to differentiate vendors. Aim for 10+ per category in gold, 30+ in silver.

Mistake 4: ambiguous rubrics. "Was the answer good?" is not a rubric. "Did the answer correctly state the policy AND avoid making promises beyond stated policy AND maintain neutral tone?" is a rubric. Specificity in scoring is the difference between defensible and impressionistic.

Mistake 5: not testing the rubric on a baseline. If the rubric scores everything 5/6, the rubric isn't differentiating. Calibrate the rubric on a baseline before applying to vendors.

How vendors should respond to your eval sets

Strong vendors:

Welcome the eval set. They want clear criteria; they have nothing to hide.
Ask reasonable clarification questions about the rubric or input format.
Run their pilot rigorously and report results honestly even on failures.
Discuss the failures specifically, often improving on the eval set itself by suggesting categories you missed.

Weak vendors:

Push back on the eval set or try to negotiate which examples count.
Request access to the labels "so we can tune our system" — refuse.
Cherry-pick the examples they handle well and downplay the ones they don't.
Get defensive about specific failures rather than learning from them.

The vendor's response to your eval set is itself a strong signal of capability and culture.

After the pilot: what to do with the eval set

The eval set keeps earning its keep:

Vendor selection (the immediate use). The pilot scoring drives the choice.

Ongoing regression testing. Once the chosen vendor is delivering, run their system against the eval set monthly to catch regression.

Production monitoring baseline. The eval set scores establish the baseline for production quality monitoring.

Fine-tuning data construction. If you eventually move to fine-tuning, the eval set is starter material for the training set or held-out test.

The 3-4 week investment to build a strong eval set pays back over years.

Eval set ownership: contractual considerations

The eval set is your IP. Make sure the vendor relationship doesn't accidentally transfer it.

Contract terms to include:

"Customer retains all rights to evaluation sets and quality criteria provided to Vendor."
"Vendor will not use evaluation set examples or labels to train models."
"Upon contract termination, Vendor will delete evaluation set materials within 30 days and certify deletion."

Some vendors will request the right to use eval data for model improvement. Decline. Your eval set is competitive intelligence about what your business needs; sharing it with vendors who serve your competitors is bad practice.

The honest takeaway

Gold set: 50-100 examples, consensus-labeled, scoring rubrics, edge cases included. Silver set: 200-500 examples, broader coverage, lighter labeling. Together they make pilots actually decide rather than impress.

The eval set is the buyer's discipline that fixes the asymmetry of vendor-driven pilots. Vendors that handle the eval set well are usually capable; vendors that resist it usually have something to hide.

Build the eval set independently. Apply it fairly across vendors. Score against pre-agreed rubrics. The 3-4 weeks of work produces decisions that actually correlate with engagement success and an asset that keeps working long after the vendor selection is done.

Frequently Asked Questions

Should the vendor build the eval set or should the buyer?

The buyer. Vendor-built eval sets optimize for what the vendor's system handles well. Buyer-built sets reflect actual production needs including the edge cases vendors prefer to avoid. Vendor input on the eval set is fine; vendor ownership of it is not.

How long does it take to build a useful gold/silver eval set?

Gold set of 100 examples: 1-2 weeks of focused work with 2-3 domain experts. Silver set of 500 examples: another 1-2 weeks with broader contributors. Total 3-4 weeks. The investment pays back many times over because the eval set keeps working — for vendor selection, then ongoing model regression, then fine-tuning data construction.

Sources

arXiv (2025) — A Practical Guide for Evaluating LLMs and LLM-Reliant Systems
Data Science at Microsoft — The path to a golden dataset, or how to evaluate your RAG?
Maxim AI — Building a Golden Dataset for AI Evaluation
Deepchecks AI — What is Golden Dataset? Characteristics, Types, Challenges
Anthropic Research — Building Effective Agents
Neurons Lab — AI Agent Evaluation Framework
NIST — AI Risk Management Framework
Stanford HAI — AI Index Report 2026
Google Research — A Strategic Framework for AI Product Development and Evaluation

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.