AI Agents in Production: What They Actually Do (With Examples)

AI AgentsMar 1, 20267 min readDoreid Haddad

In this article

A study published on arXiv in December 2025 by researchers at UC Berkeley, Stanford, and IBM surveyed 306 practitioners and ran 20 in-depth case studies of AI agents in production across 26 domains. Their finding, in one sentence: "production agents are typically built using simple, controllable approaches." Specifically, 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models rather than fine-tuning, and 74% depend primarily on human evaluation.

That picture clashes with most marketing about AI agents in 2026. The marketing version is autonomous swarms making complex decisions across long horizons. The production version is a small, specialized agent that runs through a short workflow, hands off to a human at the right moment, and gets graded by a person every week. Cleanlab's separate 2025 survey of 95 engineering leaders running AI agents in production tells the same story from a different angle: less than one in three teams are satisfied with their observability and guardrail solutions, and 70% of regulated enterprises rebuild their entire agent stack every three months.

Both data sets point at the same conclusion. Production AI agents work — but they're smaller, more constrained, and more human-supervised than the demos suggest. This article is what they actually look like, with five concrete workflow archetypes and the cost picture for each.

Workflow 1: Customer email triage

The most common production agent in 2026, per Cleanlab's data. The agent reads inbound support emails, classifies them, attaches the relevant customer context, and either drafts a reply or routes to a human queue. Below a configured confidence threshold, every case goes to human review. Above it, certain low-risk replies can auto-send.

What makes it production-grade isn't the model — it's the discipline around the model. A 200-example eval set graded weekly. A confidence threshold tuned every two weeks based on the bottom 5% of auto-handled cases. Categories that started as three buckets at launch (auto-handle, draft, escalate) usually expand to seven within 90 days as the team learns what production traffic actually looks like.

Typical setup uses Claude Sonnet 4.6 or GPT-5 as the reasoning model, with five tools: read CRM record, look up order status, search knowledge base, send email, escalate to queue. Monthly model spend on a 4,000-tickets-per-week workload runs around $1,400. The total cost — including human review on the 65% of cases that still touch a human, integration maintenance, and prompt iteration — is closer to $7,000-10,000.

The win is real but bounded. Average response time drops from around 14 hours to 2. About 35% of tickets auto-handle. Roughly 1.5 FTE worth of human time gets freed up. Not the 10x story the demo suggested. A 4x story, which is enough.

Workflow 2: Document extraction at scale

The second most common archetype, especially in finance, insurance, and operations. The agent reads PDFs, extracts a structured object — vendor, line items, totals, due dates, PO numbers — and writes the result to the system of record. Anything that doesn't validate against the original purchase order or fails internal arithmetic checks gets flagged for human review.

Two patterns make this work at scale. First, model routing: a tiny classifier sends straightforward PDFs to GPT-5 mini or Claude Haiku 4.5 at $0.25 per million input tokens, while complex PDFs (multi-page, scanned, multi-table) route to Claude Opus 4.6. Second, strict output schemas: every extracted record must conform to a typed JSON schema, and any deviation routes to human review. Output validation is the safety rail.

For a wholesale distributor processing 1,200 invoices per month, this archetype runs around $400 in model API costs and replaces roughly 16 hours of weekly data entry with about 10 hours of monthly review. Not glamorous. Reliably worth it.

Workflow 3: Research and brief generation

Higher stakes, lower volume. Common in due diligence, competitive intelligence, and analyst-heavy workflows. The agent receives a target (a company, a topic, a question), runs a sequence of retrieval steps across public filings, news, internal data, and structured databases, and produces a multi-page brief with citations.

The production discipline is heavy here because the use case is fragile. A single hallucinated date in a deal memo is a fireable mistake. Working agents in this space share three rules: every claim must trace to a source the agent actually retrieved (no claims without provenance); the agent runs multiple retrieval methods in parallel (semantic + keyword + structured query) and combines results; analysts spend their freed time fact-checking what's there, not gathering inputs.

Cost per brief lands around $40 in model spend at the volumes I've seen. Each brief used to take an analyst a full day. Production versions take an hour and a half of review. The agent didn't replace the analysts. It made them four times faster on the part of the job they hated.

Workflow 4: Internal helpdesk and knowledge access

Cleanlab's data shows this category growing fast in 2026, especially in companies with 500+ employees. The agent answers internal questions — IT, HR, facilities, policy — by retrieving from a vector store of company documentation. Routine questions ("how do I reset my VPN") get answered directly. Out-of-scope or ambiguous questions get filed as tickets and routed to the right team.

The bottleneck on this workflow type is the documentation, not the model. Teams routinely spend two months auditing and rewriting internal docs because the originals were too ambiguous for the agent to answer from. Once the docs are clean, the agent commonly handles 60-70% of inbound questions without human touch. Average response time drops from hours to under a minute.

The lesson nobody puts on the slide: garbage in, agent-flavored garbage out. The agent isn't doing the hard work — the documentation is. The agent is the access layer.

Workflow 5: Lead qualification and enrichment

A production agent that scores inbound prospects, gathers signals from public data sources (LinkedIn-scraped data, Crunchbase API, the company's own website, structured firmographic providers), and returns a fit score with three written reasons. The catch: a wrong score is worse than no score because sales teams lose trust.

Production versions handle this with two techniques. First, the score isn't delivered alone — it comes with the reasoning and the specific signals behind it, so an SDR can override with one click. Second, every override goes back into the eval set as a labeled example. The agent improves continuously from real human disagreement.

Latency is around 4 seconds per company. Throughput is roughly 8,000 prospects per week per agent. The honest part of this story: the agent is the easy half of the project. The data layer underneath — keeping the firmographic and Crunchbase pulls fresh, deduplicating company records, handling rate limits — is the hard half. Plan for that ratio.

What these all have in common

Three patterns hold across every production archetype the surveys identified.

Narrow scope. None of these tries to do "everything." Each one owns one workflow with a clear input and a clear output. The Berkeley/Stanford/IBM data — 68% of agents execute fewer than 10 steps — confirms what Anthropic's Building Effective Agents advised earlier: simple compositions beat ambitious autonomy.

Explicit failure path. Below confidence threshold, the work goes to a human. There's no "the agent decided to skip this." There's "the agent flagged this for review." The 74% of teams who depend primarily on human evaluation in MAP's data are using human review as both quality control and continuous training data.

Tied to a metric the business already tracked. Tickets per agent. Invoices per AP person. Briefs per analyst. The metric existed before the agent did. The agent moves the metric. Without that anchor, agent projects produce demos that nobody can defend at budget review.

The agents that work share more in common with a well-engineered backend service than with a sci-fi assistant. Logs, retries, idempotency, queues, observability. The MIT NANDA finding that 95% of agent pilots fail is the inverse statement of the same fact: the 5% that succeed are the ones who treated agents as production software from day one, not as experiments. Pick a workflow your team already runs every week. Narrow the scope until the input and output fit on a sticky note. Ship the smallest agent that moves the number. That's the production version. The demo is for conferences.

Frequently Asked Questions

What does an AI agent actually look like in production?

Per the Berkeley/Stanford/IBM Measuring Agents in Production study, 68% of production agents execute at most 10 steps before requiring human intervention, 70% use prompted off-the-shelf models rather than fine-tuning, and 74% rely primarily on human evaluation. The picture is smaller and more controlled than vendor marketing suggests.

Is it true that 95% of AI agent deployments fail?

That figure comes from MIT's NANDA project, which tracks enterprise AI deployments. The 5% that succeed share specific patterns: narrow scope, human review on low-confidence cases, eval discipline, and infrastructure designed for iteration rather than stability. The Cleanlab study of 95 production teams identifies the same patterns.

What are the most common AI agent use cases in production?

Cleanlab's 2025 survey identifies document processing and customer support augmentation as the most common production deployments. They share three traits: high volume, repetitive work, and clear ROI measurement. Teams generally start with constrained, measurable workflows before expanding to more autonomous use cases.

Sources

arXiv (UC Berkeley, Stanford, IBM Research) — Measuring Agents in Production
Cleanlab — AI Agents in Production 2025: Enterprise Trends and Best Practices
Google Cloud — A dev's guide to production-ready AI agents
Anthropic Research — Building Effective Agents
Microsoft — AI Agents in Production: Observability & Evaluation
NIST — AI Risk Management Framework

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.