How AI Agents Are Deployed in Production: The Infrastructure Picture

How Ai Agents Are Deployed In Production Infrastructure

AI AgentsMar 26, 20268 min readDoreid Haddad

In this article

The Google AI Overview for "ai agents in production" lists Redis, PostgreSQL, queueing, observability, and human-in-the-loop checkpoints as the infrastructure backbone. Microsoft's AI Agents in Production guidance describes essentially the same components. Google Cloud's dev guide for production-ready agents walks through them in code. The convergence isn't a coincidence — once you've watched a few agent projects survive their first quarter in production, the same shape emerges every time.

This article is what that shape actually looks like, what each component does, and what to skip until you've earned it. The PAA box on this SERP has the same question as a header: how are AI agents deployed in production? Here's the picture.

The components that show up on every production diagram

Picture a production agent stack as five layers, each with a specific job.

The reasoning layer. This is where the LLM lives. Claude Sonnet 4.6 and GPT-5 are the most common production choices in 2026. The MAP study from UC Berkeley/Stanford/IBM found that 70% of production teams use prompted off-the-shelf models rather than fine-tuned ones — the reasoning layer is rarely customized at the model level, just at the prompt level.

The state layer. Per the AI Overview's specific recommendation: Redis for short-term session state (in-flight task data, intermediate tool outputs, conversation context), PostgreSQL for long-term memory (durable records, audit trails, structured outputs that need to survive restarts). This split matters. Redis handles the millisecond-latency reads/writes the agent needs during a single task. PostgreSQL handles the records that have to be there next year for compliance review.

The retrieval layer (optional). A vector database — Pinecone, Weaviate, pgvector inside PostgreSQL — for semantic retrieval over your own documents. The Cleanlab survey makes a noteworthy point: many production teams add vector retrieval too early. Modern context windows hold roughly 600 pages of text. Most first-version agents don't need a vector store. Add it when measured context overflow forces the choice.

The action layer. This is where tool calls land. The agent's "send email" action goes through your email service. "Update CRM" goes through Salesforce or HubSpot APIs. "Create ticket" goes through Zendesk or Jira. Each integration is its own piece of code, its own auth, its own rate limit. The MAP data — 68% of agents execute fewer than 10 steps before a human intervenes — is partly a story about how few action types most production agents need. Three to five is the typical count.

The observability layer. Logs, traces, eval pipeline. Cleanlab's survey found that 63% of production teams plan to invest in observability and evaluation in the next year — it's the single biggest investment area. We'll come back to this.

How they connect, in sequence

A canonical production agent flow looks like this:

A trigger fires (webhook, scheduled job, message in a queue).
The orchestrator (a Python service, an n8n workflow, or whatever framework you've chosen) pulls relevant context from Redis (current state) and PostgreSQL (historical records).
The orchestrator constructs a prompt and calls the model API with a defined set of tools.
The model decides whether to call a tool. If yes, the orchestrator executes the tool, collects the result, validates it against a schema, and feeds it back to the model.
Steps 3-4 loop until the model produces a final structured output or hits a hard iteration cap.
The orchestrator validates the final output against a schema. If it fails validation, the case routes to a human queue. If it passes, the action executes.
Every step gets logged with full context. Every output gets graded against an eval set on a regular cadence.

That's the whole thing. The reason it looks repetitive across every dev guide is that variations on this pattern handle 90% of production cases. The interesting differences live in how each layer is implemented, not in the shape of the flow.

The Redis-vs-PostgreSQL split, made concrete

The AI Overview specifically calls out Redis for messaging and PostgreSQL for long-term memory. Here's what that looks like in practice.

Redis stores: the current agent's working state during a task, in-flight tool call results, recently retrieved context that might be needed again before the task ends, rate limit counters, distributed locks for cases where multiple agent instances might collide. Read latency is sub-millisecond. The data is ephemeral — Redis can lose state on restart and the agent shouldn't care.

PostgreSQL stores: the durable record of every task the agent has handled, including inputs, outputs, decisions, and confidence scores. Customer profiles or knowledge that has to persist across many sessions. The eval set itself, version-controlled and queryable. The audit trail required for any regulated industry.

The split exists because Redis is fast and lossy while PostgreSQL is durable and queryable. Use each for what it's good at. Trying to do everything in one or the other is a common cause of either reliability problems (PostgreSQL alone is too slow for in-flight state) or compliance problems (Redis alone loses your audit trail).

Queuing instead of always-on listeners

The AI Overview specifically advises this and it's worth understanding why. An always-on agent listening for triggers is paying for compute idle time. A queue-based agent only spins up when there's work to do.

The pattern: triggers (webhooks, cron jobs, manual API calls) drop tasks into a queue (RabbitMQ, AWS SQS, Cloud Tasks, even Postgres-backed job queues like Sidekiq or Hangfire). The agent process polls or subscribes to the queue, pulls a task, runs the agent loop, writes results, and ack's the queue. Done.

Why this matters in 2026: model costs are still meaningful, and erratic always-on agents can produce surprise bills when they get stuck in tight loops. Queuing gives you natural rate-limiting (queue depth becomes a backpressure signal), retry semantics built in, and a clean way to scale the worker count up or down without re-architecting.

Validation everywhere

This is the part most "how are AI agents deployed in production?" answers skip. Strict input/output validation is the difference between agents that drift quietly and agents that fail loudly.

The pattern: Pydantic in Python, Zod in TypeScript, JSON Schema at the messaging layer. Every output the model produces gets parsed through a strict schema. Any deviation raises an exception. The orchestrator catches the exception, logs it, and either retries with feedback to the model ("your last output didn't match the schema; here's the schema; try again") or routes the case to a human queue.

This sounds like ordinary engineering hygiene because it is. The reason to flag it: the MIT NANDA finding that 95% of agent deployments fail correlates strongly with skipped validation. Teams with weak schemas spend their first quarter chasing ghosts in production logs because malformed agent outputs poison downstream systems silently. Teams with strict schemas catch the same issues at the boundary and ship.

Observability that actually catches problems

The Cleanlab data is blunt: less than one in three production teams are satisfied with their observability tooling, and 63% plan to invest more in it next year. The investment goes to four things.

Per-run tracing. Every agent invocation produces a trace: the input received, every model call, every tool call, every retry, the final output, and the elapsed time at each step. You should be able to pull up any past run and see exactly what happened. Tools like Langfuse, Honeycomb, and Datadog APM (with custom instrumentation) are common.

Eval pipeline. Per the previous article in this cluster: a held-out eval set, run on a cadence, with pass-rate trending on a dashboard. Without this you can't tell whether the agent is improving or drifting.

Cost tracking per task. Total tokens consumed, broken down by which model, which tool, which case. This is how you find the cost-per-task lever — usually routing easier cases to a smaller model.

Alerting on failure modes. Iteration count above a hard cap (loop detection), tool call error rate above a threshold (upstream API issues), schema validation failure rate above 1% (model drift), and per-eval-category pass rate dropping (specific behaviors degrading). Each gets its own alarm.

What to skip until you've earned it

A counterweight to the dev guides that suggest you build all of this on day one. Most of these layers can be deferred for a real first version.

Skip the vector database until you've measured that context windows are insufficient. Skip the orchestration framework until you have three or more agents to coordinate. Skip the multi-region deployment until you have customers in multiple regions and latency data to support it. Skip the dedicated AI observability product until your homemade logging stops being enough.

The Cleanlab quote captures the rhythm well: "70% of regulated enterprises rebuild their AI agent stack every three months or faster." If you over-build infrastructure for v1, that infrastructure is what you'll be ripping out in q2. Build small. Add when the data tells you to. The infrastructure pattern that survives is the one that grew incrementally from a working v1, not the one that was designed in a whiteboard session before any code shipped.

A working starting stack

For a team starting tomorrow, the minimum reliable stack is roughly this:

One model API (Claude Sonnet 4.6 or GPT-5)
A Python orchestrator service with Pydantic schemas at every boundary
Redis for in-flight state
PostgreSQL for durable records and the eval set
A simple queue (Postgres-backed Sidekiq, Redis BullMQ, or AWS SQS depending on stack)
Logging to a structured format your team can query
A dashboard with eval pass rate, tasks per day, p95 latency, and cost per task

That stack runs comfortably under $500 a month for low-to-mid volume agents and handles roughly 90% of what production agents need. Everything beyond it should be added when you can name the specific reason you need it. Until then, the boring version is the version that ships.

Frequently Asked Questions

How are AI agents deployed in production?

Production agent deployments converge on a similar pattern: an LLM (Claude or GPT) for reasoning, Redis for short-term session state, PostgreSQL for long-term memory and audit logs, optional vector storage for retrieval, queuing instead of always-on listeners, strict input/output validation, and observability covering both per-agent traces and aggregate eval metrics. The Google Cloud and Microsoft production guides describe the same shape from different angles.

Do I need a vector database to deploy an AI agent?

Often no. Modern context windows hold 600+ pages of text, which covers most first-version agents without retrieval. Add a vector database when measured context overflow forces the choice — not by default.

What's the cheapest reliable agent infrastructure stack?

For low-to-mid volume agents, a Python service calling the model API directly, Redis for state, PostgreSQL for records, queue-based triggers, and a lightweight observability tool runs comfortably under $500 a month including hosting. The model API is usually 10-20% of the total operational cost regardless of scale.

Sources

Google Cloud — A dev's guide to production-ready AI agents
Microsoft — AI Agents in Production: Observability & Evaluation
Cleanlab — AI Agents in Production 2025
arXiv (UC Berkeley, Stanford, IBM) — Measuring Agents in Production
Anthropic Research — Building Effective Agents
NIST — AI Risk Management Framework

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.