Tech10
Back to blog

What Business Leaders Need to Know About AI Costs (In Plain Numbers)

What Business Leaders Need To Know About Ai Costs
AI for BusinessMar 19, 20267 min readDoreid Haddad

The model API bill is the iceberg's tip. It's the number on the slide. It's what gets the budget approved.

Then the AI project ships. The first invoice arrives and it's cheap, just like the spreadsheet predicted. The second invoice arrives and the model spend is still cheap. Then somebody asks why support headcount is the same as before launch. Somebody else asks why the engineering team has been doing prompt tuning for six weeks. Somebody else asks why the observability bill went up. Three months in, the team realizes the actual cost of running this AI system is four to five times what the spreadsheet showed.

This article is the rest of the invoice — the costs that don't show up on a vendor's pricing page but show up on yours, plus the working formula for sizing AI projects honestly before approval.

The model bill is 10-20% of what you'll spend

Cleanest way to understand AI costs: build a stacked bar chart of total spend, with the model API at the bottom. For most production AI systems, the model API is 10-20% of the total bill. The other 80-90% is everything around the model. People budget for the bottom of the stack and act surprised when the top of the stack shows up.

Here's the stack, layer by layer.

Model API. What you pay the foundation model provider. Per-token, sometimes with discounts for cached input or batch processing. This is the line on the pricing page.

Human review. The biggest hidden cost. AI systems that auto-handle some cases still need humans on the cases below the confidence threshold. Even auto-handled cases need spot-check audits. Even escalated cases need somebody to actually handle them. For an AI handling 35% of inbound work, the remaining 65% still costs whatever it cost before — the human time isn't free.

Integration glue. The agent has to read from your CRM, your help desk, your order system, your knowledge base. Each integration is code that has to be written, tested, monitored, and updated when the upstream system changes. A reasonable agent has 4-6 integrations. Each is 1-2 engineering days a month after initial build, ongoing.

Prompt and eval iteration. The prompt isn't done at launch. It's the start of refinement based on what production looks like. Realistic cadence: 30-50 cases reviewed per week, prompt updates every two weeks, eval set updates every month, model version review every quarter. Roughly 8-12 hours of senior engineering per week, every week.

Observability and infrastructure. Logs, traces, metrics, alerting. Cost: $500 to $5,000 per month depending on volume. Plus storage for inputs, outputs, and intermediate states for debugging and audit. In regulated industries, multi-year retention compounds the storage bill.

Vector databases and retrieval. If your system uses RAG, there's a vector database, an embedding pipeline, and a re-indexing process. Hosted vector databases run from a few hundred dollars per month at small scale to several thousand at enterprise scale. Plus the data engineering to keep the index fresh.

On-call. Once the AI is in the path of customer communication or important workflows, somebody is on-call when it breaks. 10-15 hours of on-call work per month at steady state for a single mid-volume system.

Compliance and audit. In regulated industries, real cost. Keeping records of every model decision, producing audit trails on demand, documenting model versions and prompt versions for the compliance team.

The token math

For the model API portion specifically, here's the working formula. Five inputs:

1. Tokens per call. Input + output. A typical agent task: 2,000 input tokens, 500 output tokens. Adjust based on your specific workload.

2. Calls per day. Internal tool: 100-1,000. Customer-facing: 1,000-50,000. Backend pipeline: 5,000-500,000. Multiply by 30 for monthly volume.

3. Model price. Mid-2026 rough numbers (always check current pricing):

  • Frontier (Claude Opus 4.6, GPT-5, Gemini Ultra): $15-30 per million input tokens, $75-150 per million output
  • Mid (Claude Sonnet 4.6, GPT-5 standard, Gemini Pro): $3-5 per million input, $15-25 per million output
  • Small (Claude Haiku 4.5, GPT-5 mini, Gemini Flash): $0.25-1 per million input, $1.25-5 per million output

4. Prompt caching ratio. Cached input is typically 10x cheaper than uncached. For a well-designed system, 50-80% of input tokens are cacheable. Default assumption if you don't know: 50%.

5. Multiplier. Multiply your model cost by 5 to get all-in operational cost. For high-touch enterprise deployments, 7-10.

Worked example. A SaaS company wants an AI assistant for their internal knowledge base. 10,000 calls/day, 3,000 input tokens (mostly retrieved context), 400 output. Claude Sonnet 4.6.

Per call: input cost (60% cached) = (3,000 × 0.6 × $0.30/M) + (3,000 × 0.4 × $3/M) = $0.00054 + $0.0036 = $0.00414. Output cost = 400 × $15/M = $0.006. Total = $0.01014 per call.

Per month: 300,000 calls × $0.01014 = $3,042 model spend. All-in: $3,042 × 5 = ~$15,000.

The budget that goes into the spreadsheet should say $15,000 a month, not $3,000.

The five biggest cost-reduction levers

1. Add a router. Send easy cases to a small classifier model, hard cases to the frontier. Often cuts spend by 50-80%. Even a basic two-tier router pays back in weeks.

2. Enable prompt caching. One-line config change on most providers. Pays back instantly when prompts have static prefixes.

3. Trim context aggressively. Most prompts include 30-50% of context the model doesn't actually need. Profile a sample of calls. Remove what isn't load-bearing. Watch the eval. If quality holds, keep the cuts.

4. Switch easy steps to smaller models. For multi-step pipelines, pin each step to the cheapest model that passes the eval for that step. Almost no pipeline needs the same model on every step.

5. Use Batch APIs. When latency doesn't matter, batch processing offers ~50% discounts. Overnight document processing, bulk classification, large translation jobs all qualify.

Five changes. Often a 60-70% reduction in monthly spend on a mature system. None require re-architecting. None require giving up quality if you're watching the eval as you go.

What a reasonable budget conversation sounds like

When you're presenting an AI project for approval, three numbers tell the real story.

Token cost. $X per month based on volume estimate.

All-in cost. Token cost × 5 (or your appropriate multiplier).

Savings or revenue impact. What does the project produce that justifies the spend?

If the savings are 3-5x the all-in cost, the project is worth doing. If below 2x, the project is risky. If above 10x, double-check assumptions — the math might be optimistic.

The most common AI cost mistakes I see in 2026

A short inventory of patterns to avoid.

Budgeting only for the API. Approving the project on the model bill, then being surprised when the operational cost shows up. Always multiply by 5.

Routing everything through the frontier model. Easy cases don't need the most capable model. A two-tier router cuts spend dramatically.

Skipping prompt caching. A free win that most teams leave on the table.

Building observability after problems start. The retroactive observability project costs 3x the proactive one.

Treating headcount savings as cost reductions on the AI line. They're real savings, but they show up under different budget categories — payroll vs cloud spend. The CFO conversation gets clearer when you separate them honestly.

Comparing AI cost to "doing nothing" instead of "doing the human alternative." AI on a workflow nobody currently does manually doesn't save anything because there's no baseline cost to reduce.

The teams who run AI well in 2026 do the boring math up front. They model the full stack, plan for operational cost, route by difficulty, cache what they can, cap loops. The model bill is the easy part. The rest is the job.

If you're sizing an AI project this quarter, multiply the model cost by 5, ask whether the savings still justify it, and only then sign. Most AI projects that look obviously valuable on the model bill alone don't survive that simple stress test. The ones that do are the ones worth shipping.

Frequently Asked Questions

What percentage of total AI project cost is actually the model API bill?

Usually 10-20%. The remainder is human review of low-confidence cases, observability tooling, integration maintenance, eval and prompt iteration, on-call rotation, and the data infrastructure underneath. Teams that budget only for tokens get surprised every time.

What's the easiest way to reduce token costs?

Three moves: enable prompt caching for repeated context, route easy cases to a smaller model, and trim context aggressively to remove fields the model doesn't actually need. Each can deliver 30-60% savings independently. Combined, they often cut model spend by 80%.

Should I use the Batch API for cost savings?

Yes, when latency doesn't matter. Batch APIs from major providers offer roughly 50% discounts in exchange for asynchronous processing. Anything that can wait minutes to hours — overnight document processing, bulk classification, large translation jobs — should run on batch.

Sources
Doreid Haddad
Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.

Read more about Doreid

Keep reading