The Real Cost of Running AI Agents in Production

The warehouse manager doesn't stare at the price of the forklift. They stare at the floor plan, the shift schedule, the damaged-goods rate, the insurance line, and the cost of the person driving the forklift. The forklift is 8% of the operation. Everything else around it is the other 92%. That's exactly the shape of the real cost of running AI agents in production, and almost every team I've watched set an agent budget has anchored on the forklift.
The token bill is the number on the pricing page. It's real. It's also 10-20% of what an agent actually costs to run at any serious scale. The other 80% is spread across infrastructure, evaluation, human review, engineering maintenance, and the cost of the agent being wrong. This piece walks through every line in that budget, in the order it shows up on a real invoice, using April 2026 pricing for Claude Sonnet 4.6, GPT-5, and Gemini 2.5.
If you're budgeting an agent project and the only number you've been given is "$X per million tokens," stop and read this before the check gets signed. Budgets built on token math alone get blown by a factor of 3-5x inside the first quarter. I've seen it in every vendor-driven deployment.
What does an AI agent actually cost per task?
The short version: 5-25x the raw token cost of a single model call, depending on how many tools the agent uses, how many review loops it runs, and how much observability you're paying for. A "single agent call" is almost always a misleading unit, because one user request can trigger 5-15 model calls inside the loop.
Walk through a single mid-complexity task to make it concrete. The task: a customer emails asking for a refund on an order that's two weeks late. The agent reads the email, classifies intent, pulls the order from Shopify, checks shipping with ShipStation, confirms the package is lost, drafts the refund in Stripe, logs the interaction in Zendesk, and writes a human-reviewable response. Eight tool calls. Three model decisions between them. Roughly four model calls end-to-end with batching.
Token math, using Claude Sonnet 4.6 at $3 per million input tokens and $15 per million output tokens:
- Average input per call (task context, tool schemas, prior results): 8,000 tokens
- Average output per call (reasoning, tool call, or final response): 500 tokens
- Four calls per task: 32,000 input, 2,000 output
- Cost: $0.096 input + $0.030 output = $0.126 per task
Twelve cents. This is the number every vendor deck leads with. At 10,000 tasks a month, that's $1,260. Not scary. Not even interesting.
Here's what actually lands on the invoice.
What infrastructure costs are invisible until the first bill?
Seven line items show up that nobody mentions in the model card. Budget for every one of them from day one or get surprised in month two.
Vector database. Every agent that pulls from a knowledge base needs one. Pinecone, Weaviate, or self-hosted pgvector. Expect $200-$1,500/month depending on corpus size and query volume. For a mid-size deployment with a million embedded chunks and moderate QPS, budget $600.
Embedding generation. When documents change, you re-embed. At $0.10 per million tokens with OpenAI's current embedding model, this is small per run but adds up if you're reprocessing nightly. Budget $100-$400/month.
Observability and tracing. LangSmith, Braintrust, Helicone, Arize. You cannot debug an agent without trace logs that show every model call, every tool response, every retry. This is not optional. Budget $300-$1,200/month. Teams that skip this pay it back in engineering time at 10x the rate.
Orchestration runtime. The servers that run your agent loop. If you're on LangGraph, CrewAI, or a custom stack, you're paying for compute. For moderate volume: $300-$1,000/month on AWS or Vercel.
Queue and retry infrastructure. Agents fail. Tool calls time out. Rate limits hit. You need a queue (SQS, Redis, Temporal) and a retry strategy. Budget $150-$500/month.
Secrets and API management. Every tool connection is a credential that needs rotation and audit. Doppler, AWS Secrets Manager, Vault. $50-$300/month.
Log storage and compliance. Agents that touch customer data generate audit trails. Regulators care about this. S3 + CloudTrail or equivalent: $100-$500/month.
Total monthly non-model infrastructure for a mid-size deployment: $1,600-$5,400. The token bill for that same deployment was $1,260. Infrastructure is bigger than tokens before a single engineer has touched the thing.
How much does human review actually cost?
The part most finance teams refuse to price correctly. Every business-critical agent needs a human checkpoint on the outputs that matter. Refunds above $100. Legal contract changes. Anything customer-facing at launch. Anything touching financial reports.
Math for the refund agent from earlier. Suppose 10,000 tickets a month, agent auto-resolves 40% cleanly, 35% need a human glance, and 25% escalate to a full human handle. That means 3,500 tickets need a fast review and 2,500 still need a real agent (the human kind, I mean).
Fast review: a human spends an average of 45 seconds approving or editing the agent's draft. At 45 seconds each, that's 43.75 hours/month. At a fully loaded cost of $60/hour, that's $2,625/month.
Full escalations: these are the same cost they were before the agent. Not a saving, not an expense, neutral.
Review is often the single largest run-cost line. $2,625 dwarfs the $1,260 token bill. It's also the line that gets wrongly labeled "existing headcount" and hidden from the project P&L. If you're not counting review time against the agent, you're lying to yourself about the ROI.
What do evaluation and maintenance really require?
An agent is not a thing you build once. Every model version change, every prompt tweak, every new tool means running the eval set again and sometimes rewriting the prompt. Treat this as ongoing engineering, not a one-time project.
Initial eval set construction. 200 labeled examples, built by the people who know the business best. 40-80 hours of domain expert time. At $100/hour loaded, that's $4,000-$8,000 upfront. Non-optional. The people who skip this step ship broken agents.
Ongoing eval runs. Each full eval pass costs roughly the same as processing 200 tasks through your agent. With the numbers above, that's about $25 in tokens per run. If you run evals weekly, that's $100/month. Trivial. Do not skip.
Prompt maintenance. Expect a prompt revision roughly every 4-6 weeks. An afternoon of engineering time, call it $800. Over a year, $8,000-$12,000 in prompt work alone.
Model upgrade cycles. Anthropic ships a new Claude model every 3-6 months. OpenAI does the same with GPT. Each upgrade is a day of re-running evals, a day of prompt adjustments, sometimes a week of retesting tool behavior. Figure 5-10 engineering days a year per agent. At $800/day, that's $4,000-$8,000.
Tool maintenance. Every external API (Shopify, Stripe, Salesforce) ships breaking changes. Budget 2-3 engineering days a month for integration upkeep on a real deployment. Another $20,000-$30,000 a year.
Total engineering maintenance: between 0.25 and 0.5 FTE for a real production agent. At $150k loaded per engineer, that's $37,500-$75,000/year. Per agent.
This is where the sticker shock hits. The agent isn't "built." It's hosted. Like a production database, not like a feature launch.
What is the cost of the agent being wrong?
The line item nobody models and everybody pays. A hallucinated refund sent to the wrong customer. A contract summary that misses a liability clause. A compliance agent that filed something it shouldn't have. One incident like this can eat a year of savings.
Risk math, done plainly:
- Probability of a material error per 10,000 tasks: 0.1-2%, depending on the task and how aggressive the human checkpoint is. Even a well-built agent with a 1% error rate on 10,000 tasks is 100 errors a month.
- Cost of one material error: varies wildly. A misrouted support ticket is $2. A wrongly approved refund is $40-$500. A mistakenly sent legal doc is $5,000-$50,000.
- Expected monthly loss: for a customer service agent, maybe $200-$2,000/month. For a legal document agent with no human checkpoint, catastrophic.
The insight most teams miss: the way you reduce this cost isn't a better model. It's a better workflow. Route high-cost actions through a human. Sample-check low-cost actions. Build an anomaly detector that flags when the agent's behavior changes. This is what real production looks like, and it's what gets cut from the budget when someone prices the project on tokens alone.
How do the numbers add up for a real deployment?
One full budget, month-by-month, for a hypothetical: a 200-person company building a customer support agent that handles 10,000 monthly tickets across Shopify, Stripe, and Zendesk, with a 60% auto-resolution target.
Build phase (months 1-3):
- Engineering (1.5 FTE for 12 weeks): $54,000
- Eval set construction (domain experts): $6,000
- Infrastructure setup (vendors, contracts): $4,000
- Consulting or specialized help: $20,000
- Contingency: $10,000
- Build total: $94,000
Run phase (monthly, starting month 4):
- Model tokens (Claude Sonnet 4.6): $1,260
- Vector DB + embeddings: $700
- Observability + tracing: $900
- Orchestration runtime: $600
- Queue + retry + secrets + logs: $650
- Human review (3,500 fast reviews): $2,625
- Engineering maintenance (0.35 FTE): $4,400
- Error cost reserve: $800
- Monthly run total: $11,935
Year-one total: $94k build + 9 months run = $201,000.
Compared to the vendor deck that showed $2k/month in tokens. The ratio is almost exactly 5x, which matches what I've seen across every real deployment.
ROI math. If the agent saves 1.2 FTE worth of support work at $80k loaded per FTE, that's $96,000 of year-one value. ROI in year one: roughly 0.48x. Year two, without the build cost, the run rate is $143k and the savings are still $96k, so the ROI stays below 1x.
That hurts. But it's the right number. The project still makes sense if throughput scales (same agent handling 25,000 tickets costs almost the same to run) or if the savings are in response time and customer experience, not just headcount. What it does not make sense to do is build this expecting a 3x return in year one. That happens in the slide deck. Not in production.
What drives the biggest variation in cost?
Three levers move the total by 2-3x in either direction. Every team should pull these before signing a check.
Lever 1: Model choice. The same workflow on Claude Haiku 4.5 instead of Sonnet 4.6 cuts the token bill by roughly 80%. For tasks where Haiku clears your accuracy bar, this is free money. For tasks where it doesn't, switching saves nothing because the review cost spikes. Run evals on both. Pick the cheapest model that passes.
Lever 2: Human review policy. An agent with a 90% auto-resolution target costs wildly more than one at 50%, not because the tokens are different but because the review time bends the whole budget. Start at 50% auto-resolve, tighten over months as evals prove reliability.
Lever 3: Observability investment. Teams that skimp on tracing think they're saving $800/month. They pay it back in engineering time the first time an agent starts producing weird outputs in production. The right observability saves 5-10 engineering days per quarter, which is $10k-$20k a year.
The cheapest agent that works is almost never the one with the cheapest model. It's the one with the right workflow, the right review policy, and the right observability. The model is a line item. The system is the product.
If you want the broader framework for deciding whether an agent is the right pattern at all, the pillar guide walks through the Job-Tool-Judgment test. If you're at the "should we even do this" stage, start with most businesses don't need AI agents before spending a dollar.
Frequently Asked Questions
What's a realistic first-year budget for one production agent?
For a mid-complexity agent at meaningful volume: $150k-$300k all-in, including build, run, and maintenance. Smaller scopes ship for $50k-$100k. Bigger ones, especially with strict compliance, run $500k+.
Can I avoid most of this cost by using an off-the-shelf agent platform?
You avoid some of it, you pay a platform markup, and you trade flexibility for speed. Platforms like Zendesk's AI agents or Salesforce Agentforce bundle observability, orchestration, and sometimes review tooling. The monthly cost moves from $12k of composed services to $8k-$15k of platform fees. The tradeoff is you're locked to their integrations and their pricing model.
How do I know if I'm overpaying?
Three signals. Token cost is less than 30% of your agent spend (healthy). Human review time is accounted for as agent cost, not as 'existing headcount' (healthy). You've run evals in the last 30 days with a documented pass rate (healthy). If any one of those isn't true, you're probably overpaying or under-measuring.
Is it cheaper to self-host open-source models for agent workloads?
Rarely, in 2026. The operational cost of running Llama 3 or DeepSeek at production quality with low latency still runs $2k-$8k/month in infrastructure plus an engineer to maintain it. For most mid-market deployments, Claude or GPT on-demand is cheaper and faster to ship. Self-hosting makes sense at very high volume, for strict data residency requirements, or for tasks where a fine-tuned small model clears the accuracy bar.
Sources
- Anthropic — Claude API pricing
- OpenAI — OpenAI pricing
- Google — Gemini API pricing
- McKinsey — The economic potential of generative AI
- Deloitte — State of Generative AI in the Enterprise, Q4 2025
- Gartner — Hype Cycle for Artificial Intelligence, 2025
- NIST — AI Risk Management Framework
- Forrester — Total Economic Impact studies

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


