What Actually Changed in the AI Stack: Generative to Agentic

Agentic AI did not arrive because models got smarter. It arrived because four specific pieces of the AI stack crossed reliability thresholds between 2023 and 2026: tool-calling accuracy, context window size, a shared protocol for connecting models to data, and structured output. None of those are "the model." All of them are plumbing. Teams that understand this build faster and waste less money.
The public conversation keeps framing the shift as "generative AI creates content, agentic AI takes action." That's accurate and also shallow. It explains the behavior. It does not explain why the same architecture people tried in 2022 (ReAct, AutoGPT, early LangChain) now works in production and didn't work then. The model is doing roughly the same job it was doing two years ago: read text, predict the next token. What changed is everything around the model.
This article is a stack-level view. If you've read our comparison of the two approaches side by side or the four-stage loop walkthrough, think of this as the "why now" that sits underneath them.
Why this matters before you spend a single engineering hour
Most teams are still budgeting for agents the way they budgeted for chatbots in 2023. They price the model and forget the rest. Then the real bill arrives: orchestration code, vector databases, tool integrations, eval infrastructure, human review. If you don't understand which parts of the stack became reliable recently, you will over-invest in the parts that were never the problem and under-invest in the parts that still are.
The short version: the model is roughly 10-20% of a working agent system. The other 80% is the stack. And the stack is where the last three years of progress actually lives.
The four components that crossed the threshold
There's no single invention that made agents work. There are four things that individually got good enough, and collectively tipped the system from "interesting demo" to "reliably shipping."
1. Tool-calling accuracy
In 2023, if you asked a model to call an external function with structured arguments, it got it right maybe 40-60% of the time on real-world tasks. That number was the ceiling on everything. A 55% success rate per call compounds badly: a five-step workflow fails roughly 80% of the time even if each individual step succeeds more than half of the time.
By early 2026, top frontier models on the Berkeley Function-Calling Leaderboard routinely score above 85% on non-trivial benchmarks, with some in the low 90s. That doesn't make agents perfect. It means the math of multi-step workflows finally allows for a profit. Five steps at 92% each is 65.9% end-to-end. Add a verification loop and you're in the mid-80s. That's shippable.
The move: when you evaluate a model for agent work, ignore chat benchmarks. Look at function-calling benchmarks on the exact category of tool use your workflow needs. The trap: picking a model because it's the newest or the most quoted on social. Tool-call reliability varies widely between model families even at similar tiers.
2. Context window size
A context window is how much text a model can read in one pass. In 2022, most production models had around 4,000 tokens of context. That's about 3,000 words of history. An agent working on a three-step task ran out of room somewhere between step one and step two because it had to carry its plan, the conversation so far, the tool output, and the system prompt all at once.
Today, frontier models like Claude Sonnet 4.6 have 200,000-token context windows. Some Gemini variants go over a million. That's about 750 pages of text the model can hold in its short-term memory in a single pass.
Why does this matter for agents? Because an agentic loop is mostly bookkeeping. The agent has to remember what it tried, what worked, what the tool returned, and what the user originally asked for. Before, that bookkeeping pushed out of the window fast, and the agent either forgot the goal or started hallucinating state. Now it fits.
The insight most teams miss: a big context window doesn't mean you should fill it. Costs are linear with input tokens. At ~$3 per million input tokens for Sonnet 4.6, a 200K context costs about $0.60 per call. Fire that ten times in a loop and the model bill alone is $6 for one task. Context is a resource, not a right.
3. Model Context Protocol
Model Context Protocol (MCP) is an open standard Anthropic released in late 2024 that describes how an AI application talks to tools, data sources, and services. Before MCP, every team that wanted their agent to reach Gmail, a database, or an internal API wrote custom glue code. Every tool needed a bespoke wrapper. Every model needed different wrapper conventions. The plumbing ate the project.
MCP is boring infrastructure. That's the point. Think of it like USB. Before USB, every printer, keyboard, and mouse had its own port and its own driver. After USB, you plugged things in and they worked. MCP is that, for AI models and tools. An MCP server exposes a capability. An MCP client (the model application) queries it. Communication uses a standard message format based on JSON-RPC 2.0. You stop writing glue code and start connecting things.
The practical shift: a workflow that took three weeks of custom integration work in 2023 takes a day in 2026. That's not a marketing claim. That's the plumbing getting good enough that you can focus on the actual problem.
The trap: treating MCP as a requirement instead of a tool. If you're building one agent for one workflow, raw API calls may still be faster. MCP pays off when you have multiple agents, multiple tools, or a shared team.
4. Structured output
Structured output means the model returns data in an exact format your software can read, not free-form text. In 2023, you prompted the model, asked it nicely to return JSON, and then wrote defensive parsing code that covered the cases where the model returned "Sure, here's the JSON you asked for:" followed by JSON wrapped in markdown code fences. About 2-5% of the time it still failed. On a busy pipeline, that's hundreds of failures per day.
Frontier providers now expose structured output modes that guarantee the schema. You define a schema. The model returns data that matches it. No markdown. No apologies. No free-text preamble. The parsing code you used to write disappears.
That may sound small. It's not. Structured output is the invisible glue that lets agents chain tool calls together without a human checking every handoff. When the model can reliably say "call function get_orders with customer_id=A123" in a format your code trusts, the whole loop becomes engineerable. Before, it was held together with regex and hope.
What's the same as 2023
All four of those are infrastructure. The model architecture itself (transformer, attention, training objective) is almost identical to what existed three years ago. Scale has increased, training data has improved, post-training techniques have matured. But the core thing, "read text and predict the next token," hasn't changed shape.
This is why teams that built on ReAct or early AutoGPT in 2023 and watched those systems crumble in production aren't wrong. They weren't early. They were on the same track, but the track wasn't laid yet. The model was doing the reasoning. The rails underneath couldn't hold the weight.
If you search for agent code from that era, you'll find it fundamentally recognizable. Same loop. Same reasoning step, same tool-calling step, same verification step. It just didn't work end-to-end because at least one of the four components was below threshold. Often all four.
Where the cost actually lives in 2026
Once you understand which parts of the stack did the work, you can see where agent projects spend money. This is where most teams get surprised.
| Cost category | Share of total | Notes |
|---|---|---|
| Model tokens (input + output) | 15-25% | What vendors quote in pricing tables |
| Orchestration + retry logic | 15-20% | Code to run the loop, handle failures, manage state |
| Tool integration + MCP servers | 10-15% | Connecting the agent to real systems |
| Vector database + RAG pipeline | 5-15% | Memory and retrieval when needed |
| Eval infrastructure | 10-15% | Benchmarks, regression tests, drift monitoring |
| Human review (the real line item) | 20-40% | Supervision on low-confidence or high-stakes outputs |
Model tokens are the number on the pricing page. They are usually the smallest line. The companies doing this well in 2026 treat agent projects like infrastructure projects. They budget for the plumbing, not just the model.
This is the conversation that gets skipped in most AI guides. You can read the full cost math on a real agent workload if you want specific numbers. The short version is: if your budget assumes the model is the whole bill, you're going to spend twice. Once on the model you picked. Then again on everything you didn't budget for.
How to sanity-check an agent project in 30 minutes
Before spending engineering time, run this checklist. It catches 80% of the projects that shouldn't be agents.
- Is the task actually multi-step across separate systems? If the whole job happens inside one tool, you probably want a plain generative call, not an agent. A one-shot summary of a PDF is not an agent task.
- Do you have a way to grade the output? Not "does it feel right." An actual eval set with inputs and expected behavior. If you can't grade it, you can't improve it, and you can't run the loop safely.
- Does at least one step involve real-world action (write, update, send, purchase)? If the answer is no, generative AI is faster, cheaper, and good enough. Agentic only earns its cost when something outside the model changes state.
- Can you define a hard failure budget? Something like "this is fine at 95% success and the other 5% gets a human." If you can't, stop. You're not ready to ship autonomy yet.
- Is the task repetitive enough to pay back the setup cost? If it happens ten times a week, a one-shot script with a human is cheaper. If it happens ten thousand times a week, an agent starts to make sense.
Teams that pass all five ship. Teams that fail on any of them should not be building agents yet. Worth stopping the project right there.
What this means for your roadmap
Three practical consequences if you're planning AI work for the rest of 2026.
First, don't pick a model before you've mapped the stack. The model is the last decision, not the first. Map what the agent needs to touch, which tools exist or need MCP servers, how you'll evaluate trajectories, and where a human sits in the loop. Then pick the model that clears your bar for the smallest bill.
Second, assume 80% of the work is not the model. Budget accordingly. Tell your CFO the line items. The teams that surprise their finance people with AI bills are almost always the teams who thought the model was the whole cost.
Third, test in isolation. The hardest agent bugs come from compound failures across components. A 95% tool-call rate plus a 98% structured-output rate plus a 93% plan validity rate sounds fine until you realize the product of those three is 86.8% end-to-end. That gap is where production issues live.
The honest uncertainty
Here's what I don't know yet, and nobody else does either. We're roughly 18 months into the period where agents actually work. We have not seen what a five-year-old production agent looks like. We have not seen the failure modes that show up at scale, across model upgrades, across team handoffs. The teams shipping agent systems today are the first cohort to find those out.
That's not a reason to wait. It's a reason to build with the assumption that your agent system will be rewritten within 12-24 months. Keep the orchestration layer yours. Keep the eval set yours. Swap models in and out as new ones ship. The worst thing you can do right now is build a system so tightly coupled to one provider that you can't move when the next shift happens.
Because another shift is coming. The stack is still maturing. When it shifts again, you want your orchestration, your tools, and your eval set to survive. Everything else is replaceable.
Frequently Asked Questions
What specifically changed between 2023 and 2026 that made agents work?
Four stack components crossed reliability thresholds: tool-calling accuracy (from 40-60% to 85%+ on the Berkeley Function-Calling Leaderboard), context window size (from 4,000 tokens to 200,000+), Model Context Protocol (released by Anthropic in late 2024 as a standard for connecting models to tools), and structured output (which guarantees schema-compliant data instead of hoping the model returns clean JSON).
Is Model Context Protocol required to build an agent?
No. MCP is a standard that pays off when you have multiple agents, multiple tools, or a shared engineering team. For a single-purpose agent with one or two tools, raw API calls may still be faster.
How much of an agent project's cost is actually the model?
15-25% in typical production setups. The other 75-85% goes to orchestration, tool integration, evaluation infrastructure, retrieval pipelines, and human review. Teams that budget only for the model consistently overspend 2-3x.
Does a bigger context window mean better agent performance?
Not necessarily. Input tokens are billed linearly, so a 200K-token call at around $3 per million input tokens costs $0.60 per call, and agents run these in loops. Context is a resource to manage, not a ceiling to hit. Most well-designed agents stay well below their context limit.
Sources
- Anthropic — Introducing the Model Context Protocol
- Anthropic — Effective context engineering for AI agents
- Model Context Protocol — Model Context Protocol specification
- UC Berkeley — Berkeley Function-Calling Leaderboard
- Google Cloud — What is Model Context Protocol (MCP)? A guide
- IBM — Agentic AI vs. Generative AI
- MIT Sloan — Agentic AI, explained

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


