How Agentic AI Actually Works: From Chat to Action

AI AgentsMar 13, 202613 min readDoreid Haddad

In this article

Walk into any airport control tower at 4pm on a Friday and you will see a pattern that looks almost exactly like the inside of an agentic AI system. Controllers are not flying the planes. They are looking at a radar screen, reading the state of every aircraft in their sector, making a decision about the next move, issuing an instruction, and then looking at the radar again to see whether the instruction worked. If it did, they move on. If it did not, they issue a correction. The job is a loop. Perceive, plan, act, verify. Perceive, plan, act, verify. The controllers are not remarkable because they fly better than the pilots. They are remarkable because they run that loop for eight hours straight without dropping one.

That is exactly what an agentic AI system does. The model does not fly the plane. It looks at the state of the task, picks the next move, makes the call, and checks the result. The thing that changed in 2026 is not the "picking the next move" part. Models have been doing that for years. What changed is that the "making the call" and "checking the result" parts finally became reliable enough to let the model run the loop without a human driving every step.

If you have been told agentic AI is a new kind of intelligence, that framing is going to mislead you. It is a new kind of scaffolding around the same kind of model you were using in 2023. The interesting engineering does not sit inside the model. It sits in the loop around it. The rest of this article is a walk through that loop, one stage at a time, with what breaks and how much each stage actually costs.

Generative AI ends where the text does

Before we open up the loop, it helps to be clear about what generative AI is not doing. When you give Claude Sonnet 4.6 a prompt like "draft an apology email to this customer," the model reads the prompt, predicts the next token, predicts the next one, predicts the next one, and stops when it decides the reply is finished. The system is done when the text is done. Nothing in the world changed. The draft sits in a window until a human copies it somewhere.

That is a complete product for a lot of jobs. Drafting, summarizing, classifying, translating, rewriting. All of those jobs end with a piece of text, and a human does whatever comes next. If this is your use case, everything in this article is interesting background but not something you need to build.

Agentic AI begins exactly where the text would have ended. Instead of stopping when the model has produced an answer, the system takes the answer, figures out what tool to call, calls it, reads the response, and asks the model what to do next. The model is still generating tokens. It is just that some of those tokens are tool calls, and the system treats them as instructions to itself.

That is the whole shift, architecturally. The model did not become more intelligent. The system around it stopped stopping.

The decision loop: the real heart of agentic AI

Every agentic system, regardless of vendor, runs a four-stage cycle that repeats until the task is done. Different frameworks call the stages different things. The stages are the same.

Perceive. Read the current state. What does the task look like right now? What has the agent already done? What came back from the last tool call?
Plan. Decide the next move. Given the current state, what is the best thing to do next? Is the task finished? Does it need another tool call? Does it need to escalate to a human?
Act. Execute the plan. Call the tool. Send the message. Post the update. Hand control to the external world briefly.
Verify. Read the result of the act. Did it work? Did it return an error? Did the state change in the expected way?

Then back to perceive. And again. And again, until a stop condition fires.

A simple task takes 3-5 loops. A complex task takes 15-30. A runaway task takes 200 and then trips a circuit breaker. The cost of an agentic run scales with the number of loops, not the length of the input. This is the part that surprises teams the first time they see a bill.

The rest of this article walks through each stage, what it actually does, what breaks, and what it costs.

Stage 1: Perceive

The perceive stage is the part where the model reads everything relevant about the current state of the task. This is not trivial. A real agent has to keep track of the user's request, the tools it has available, the outputs of the last N tool calls, the error messages it hit, and sometimes a running "scratchpad" where it writes its own notes. All of this has to fit in the model's context window, which is its short-term memory for this task.

Context windows used to be tight. GPT-3.5 had 4,000 tokens. GPT-4 shipped with 8,000. The big unlock in 2024-2025 was context windows growing to 200,000 tokens on Claude Sonnet 4.6 and 1 million on Gemini 2.5. To put that in a size most people can feel: 200,000 tokens is about a 500-page book. You can drop an entire customer case file, the company policy documents, and the last hundred tool call results into a single agent run and the model sees all of it at once. This is why agents that used to fail at step 7 now succeed at step 25. Their short-term memory finally stopped filling up.

The failure mode at this stage is simple to describe and hard to fix: the model misreads the state. It thinks the refund already went through when it did not. It thinks the customer email is about order #447 when it is about order #744. It thinks the previous tool call succeeded when it silently failed. When this happens, everything downstream is wrong, because every subsequent decision is based on a broken mental model of where the task is.

The fix is boringly practical. Structure your perception. Do not hand the agent a raw dump of everything. Give it a clearly labeled state: "current user request," "tools called so far," "last tool result," "known errors." The closer you get to a structured state object, the fewer "I thought we already did that" errors you see.

The cost: the perception step costs roughly 60-80% of your total token bill on a typical agent run. It is the biggest line item, because the model re-reads the entire context on every loop. If you are trying to cut the bill on an agent, the first thing to look at is whether you are stuffing the perception stage with stuff it does not need.

Stage 2: Plan

Plan is the stage where the model decides what to do next. It looks at the current state and picks one of: call a tool, ask the user a question, escalate to a human, or declare the task complete.

This is the stage that looks most like generative AI, because it is the model thinking. The difference is that the output is constrained. The model is not writing prose. It is writing a tool call in a structured format, usually JSON, that the orchestration layer parses and executes. This is where tool calling accuracy matters. On Claude Sonnet 4.6, tool call accuracy on standard benchmarks is in the low 90s. On GPT-5, similar. On smaller or older models, it drops into the 70s and the system becomes unusable for real work, because a 30% failure rate on every decision compounds fast.

The failure modes at this stage are the ones that make for the best war stories. The model can call a tool that does not exist. It can pass the wrong arguments. It can loop by calling the same tool over and over with slightly different inputs, convinced that the next call will work. It can get stuck in "verification mode" where it keeps double-checking the same data. It can decide the task is done when it is not. I have seen agents confidently declare "refund processed successfully" when the refund tool had thrown a 500 error that the agent did not read.

The fix for most of these is a combination of schema validation on the tool calls (no more "tool does not exist"), retry limits (no more infinite loops), and a mandatory "verify" step after every action (no more "I assume that worked"). The orchestration layer enforces all of this. The model does not.

The cost: planning is cheap compared to perception, because the model is generating fewer tokens. Usually 10-15% of total run cost. But this is where model quality shows up. Paying for Sonnet 4.6 instead of Haiku 4.5 on the planning step is usually worth it, even if every other step runs on Haiku.

Stage 3: Act

Act is the stage where the orchestration layer executes the tool call the model asked for. This is not the model's job anymore. The model handed off a structured request. The system takes it, calls the real tool (an HTTP endpoint, a database query, a Slack webhook, a payment API), and captures the response.

This is where agentic AI touches the real world, and the real world is not forgiving. Tools time out. Tools return errors. Tools return data in a slightly different format than expected. Tools succeed but do not show up in the downstream system until five minutes later (hello, eventual consistency). Tools are rate-limited, authenticated, and sometimes flat out down. Every one of these needs to be handled, not by the model, but by the orchestration code.

The rise of Model Context Protocol (MCP) in late 2024 was the single biggest improvement at this stage. Before MCP, connecting a model to a new tool meant writing custom glue code for every integration: define the tool, write the adapter, handle auth, manage schemas. MCP standardized that. Now you plug in an MCP server for Slack, or Google Drive, or a CRM, and the tool becomes available to any agent that speaks the protocol. It did for agent tooling what USB did for peripherals. Not more capable. Just standard.

The cost: act-stage cost is mostly in engineering, not tokens. Every tool you add is another adapter to build, test, monitor, and maintain. I usually budget one engineering week per tool for the first few tools, trending down as the team gets a template. If anyone tells you "we just wire up 20 tools in a weekend," they have not done it in production.

Stage 4: Verify

Verify is the stage everyone skips, and the stage where the wheels come off when it is skipped. After the tool ran, did the action actually happen? Did the refund post? Did the email send? Did the database row update? Did the calendar invite land in the right person's calendar?

Verification is not "the tool returned 200 OK." It is "I checked the downstream state and the thing I expected to change actually changed." Those are different questions, and they are different by exactly the amount of damage you can cause by answering them the same. A "200 OK" from the refund API does not mean the customer was refunded. It means the refund was accepted for processing. Fifteen seconds later, the payment processor might reject it for insufficient merchant funds. If your agent moved on after the 200 and never checked again, you now have a ticket closed as "refunded" in your system and a customer who was never refunded. That single mismatch is, in my experience, the most expensive failure pattern in agentic AI.

The fix is simple to say and annoying to build. Every action gets a corresponding verification: a readback query, a state check, a confirmation handshake. Before the agent moves on, it has to see evidence that the action landed. If the evidence is missing, it waits, retries, or escalates to a human.

The teams that build verification in from day one ship production agents. The teams that add it later ship incidents.

The cost: verification roughly doubles the number of tool calls, which roughly doubles the token cost of the act-and-verify steps. Worth it.

What the whole cycle costs

A hypothetical cost trace for a single mid-complexity agent run, say a support resolution loop that reads a ticket, looks up an order, checks shipping status, issues a refund, and replies to the customer. Five tool calls, six loop iterations.

Perception tokens (the model re-reads context 6 times): roughly 15,000 input tokens per loop on average × 6 = 90,000 input tokens. At Claude Sonnet 4.6 input pricing, around $0.27.
Planning tokens (model reasoning and tool call generation): roughly 800 output tokens per loop × 6 = 4,800 tokens. Around $0.07.
Tool calls: 5 calls, each costing some combination of internal infrastructure and external API fees. Usually a few cents total.
Verification tokens: additional 300 output tokens per verify step × 5 = 1,500 tokens. Around $0.02.
Orchestration overhead: queueing, logging, retries, observability writes. Amortized, a few cents per run.

Single-run cost: around $0.42. Multiply by your volume. At 2,000 runs a month, that is $840 in token and infrastructure. The orchestration platform (LangGraph, or a custom build, or a vendor like LangSmith or Braintrust for observability) adds another $500-$1,500 a month depending on scale. Human review of the 5-10% of runs that escalate adds whatever a few hours of a reviewer's time costs.

The full math, including engineering maintenance, for a mid-complexity agent in production runs in the $4,000-$8,000/month range. Most teams budget for the $840 part and are stunned by the rest.

When not to build this

Do not build an agentic loop when any of the following are true.

The task happens the same way every time. That is not an agent job. It is a cron job. Every loop you add to a deterministic task is money and complexity you did not need.
A human will review every output anyway. You are paying for a loop that terminates on a human review. Skip the loop. Build generative with a queue.
The action you are automating takes under 60 seconds of human time. Automating 60 seconds with $2 of infrastructure per run is a trade that almost never pencils.
The cost of a wrong action is unbounded or legally sensitive. Medical, legal, financial decisions with real exposure. Either do not build it, or gate every action behind a human, which means you did not need the agent in the first place.
You do not have an evaluation set. No eval means no pass rate means no way to know whether you shipped a working system or a confident hallucinator. I would rather see a team spend their first two weeks building the eval and not write a single prompt than the reverse.

For the framework that decides whether your use case is agentic in the first place, the pillar on Agentic AI vs Generative AI: What Actually Changed in 2026 lays out the Output vs Action lens and the four-part checklist. For a cost-math side-by-side of the two patterns on identical workloads, see Generative AI vs Agentic AI: Side-by-Side for Business Leaders. The contrarian case, that most of the "agentic shift" is a reliability story rather than a capability story, is in Agentic AI Isn't New. Here's What Actually Changed..

Frequently Asked Questions

What is the difference between a prompt and an agent loop?

A prompt is a single request to a model that returns one response. An agent loop wraps the prompt in a controller that reads tool results, generates the next prompt, calls the next tool, and repeats until a stop condition fires. The model is the same. The loop is the difference.

Why does the perception stage cost 60-80% of the bill?

The model re-reads the entire task context on every loop iteration to decide the next move. With a six-step task, the perception tokens are charged six times. Everything else (planning output, verification, tool calls) is smaller per loop. Cutting perception cost means tightening the context you pass each loop, not picking a cheaper model.

Can I skip the verify stage if the tool returned 200 OK?

No, and this is the most expensive shortcut in agentic AI. A 200 OK means the tool accepted the request. It does not mean the downstream state actually changed. Verification reads the real state and confirms the change landed. Skipping it is how refunds get closed as 'processed' when they never posted.

How many tool calls is too many in one agent run?

A simple task runs in 3-5 loops. A complex task runs in 15-30. If your agent is making 50+ tool calls on a single run, something is wrong: either the perception stage is losing state, or the plan stage is looping, or the task is too broad for one agent. Break it into smaller agents with a coordinator.

Sources

Anthropic — Introducing the Model Context Protocol
Anthropic — Claude API: tool use documentation
UC Berkeley — Berkeley Function-Calling Leaderboard
LangChain — LangGraph documentation
Anthropic — Claude Sonnet 4.6 model card
Google — Gemini API long context documentation
OpenAI — OpenAI GPT-5 documentation
NIST — AI Risk Management Framework

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.