The Multi-Agent Architecture: When One AI Isn't Enough

AI AgentsMar 29, 20267 min readDoreid Haddad

In this article

In June 2025, Anthropic published the engineering retrospective for their multi-agent Research feature. Buried in the post is a number that should anchor every multi-agent architecture decision in 2026: a system with Claude Opus 4 as lead agent and Claude Sonnet 4 as subagents outperformed single-agent Claude Opus 4 by 90.2% on Anthropic's internal research evaluation. That's not a marginal improvement. That's the difference between a system that works and one that doesn't.

The same post names the price. Multi-agent systems use roughly 15× more tokens than chat interactions. Agents alone use about 4× more tokens than chats. The 90.2% performance gain comes with a 4× cost premium over single-agent and a 15× premium over the cheapest possible alternative. Multi-agent is powerful and expensive. The decision is when the power is worth the price.

This article is the cost-and-value frame for when multi-agent architecture earns its seat — and when it doesn't. The Google AI Overview for "multi-agent architecture" lists hierarchical, swarm, and sequential patterns and frames the technology in pure-positive terms. The honest version has a cost line under each benefit.

What multi-agent architecture actually is

A multi-agent system, per IBM's working definition, consists of multiple AI agents working collectively to perform tasks. In LLM-era practice, that means multiple language model invocations coordinating — often with specialized roles, sometimes with separate context windows, usually with structured handoffs.

The shape that matters for architecture decisions: a lead agent (or orchestrator) decides what needs to happen and delegates subtasks to subagents. The subagents work in parallel. The lead agent collects results and synthesizes a final answer. Anthropic's Research feature uses this exact pattern — they call it orchestrator-worker. Microsoft's Azure Architecture Center pattern guide describes the same structure. AWS's multi-agent orchestration guidance walks through implementation. The convergence isn't accidental. The orchestrator-worker pattern is what most production multi-agent systems converge on.

Why multi-agent works (when it works)

Anthropic's analysis of their Research feature isolates the mechanism. Three factors explain 95% of performance variance on the BrowseComp evaluation: token usage explains 80% on its own, with tool calls and model choice explaining the rest. The deeper interpretation: multi-agent systems work mainly because they let the system spend enough tokens on the problem.

A single agent, no matter how capable, hits two structural limits. Context-window saturation: a 200K-token window holds about 600 pages of text, which sounds enormous until you're researching a topic that genuinely requires reading and synthesizing more. And serial reasoning: a single agent does one thing at a time. Some research questions can be decomposed into independent sub-questions that run in parallel — every additional agent multiplies the effective context capacity and parallelizes the work.

The economic restatement: multi-agent gives you more compute aimed at the problem in less wall-clock time. When the value of the answer justifies the spend, that's a great trade. When it doesn't, you're just burning money in parallel.

When multi-agent earns its seat

Three patterns where the math works.

Heavy parallelization with independent subproblems. Research questions that fan out into multiple independent threads. Comparative analyses across many entities (e.g., "find the board members for every IT company in the S&P 500" — Anthropic's specific example, which their multi-agent system solved and a single agent failed to solve). Multi-source verification where each source can be checked in parallel.

Information that exceeds a single context window. When the relevant context for a single answer genuinely doesn't fit in 200K tokens, subagents with their own context windows are how you scale. Each subagent reads a chunk, distills the relevant findings, returns a compressed result to the lead agent. The lead agent works on the synthesized findings, never seeing the raw data.

Tasks where specialization beats generalization. When the work involves several different kinds of expertise — a contracts lawyer for one part, a tax accountant for another, a customer rep for a third — separate agents with separate prompts and tools outperform one agent juggling everything.

The common factor: high task value (because the cost is real), parallelizable work (because that's what multi-agent unlocks), and complexity that genuinely exceeds what one prompt can handle.

When multi-agent is the wrong answer

Anthropic is unusually direct about the exclusions in the same post. "Most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time." That's a deliberate non-recommendation from the company that publishes Claude Code.

Three patterns where multi-agent makes the system worse.

Tasks with many dependencies between agents. If subagent B can't start until subagent A is done, parallelism is illusory and you've added orchestration cost for no benefit. Build a pipeline (sequential workflow), not a multi-agent system.

Tasks requiring shared evolving context. When the agents need to operate on the same continuously-changing context — a long programming session, a multi-turn conversation that builds on prior turns — the cost of synchronizing context across agents usually exceeds the benefit of parallelism.

Tasks where the model is the bottleneck, not context or parallelism. If a single Claude Opus 4 can produce a correct answer with the right prompt, adding agents doesn't help. The performance ceiling is on the model, not on the architecture.

The pattern under all three: if the task isn't shaped like "many independent things at once," multi-agent doesn't earn its seat.

The economic ceiling

Anthropic's published cost ratios are the constraint to internalize. Multi-agent uses ~15× the tokens of a chat interaction. For a system that runs through 4,000 monthly tasks, that scales fast.

Concrete: at $5 per million input tokens and $25 per million output tokens (rough mid-2026 frontier pricing), a multi-agent research task that uses 100K tokens of input and 5K tokens of output costs about $0.625. Multiply by 4,000 monthly tasks and the model bill is $2,500. The same workflow as a single-agent system costs roughly $170 — a 15× difference. The performance gain has to be worth $2,330 a month, every month.

That's why Anthropic explicitly notes that "for economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance." Research that drives investment decisions, due diligence on acquisitions, deep competitive analysis — these have unit economics that justify the spend. Generic customer support questions don't. Match the architecture to the task value, not to the marketing.

A working decision rule

When you're sizing a multi-agent system, four questions in order.

1. Is the task genuinely parallelizable? Can you decompose it into 3+ subtasks that don't depend on each other's outputs? If no, build a single-agent system or a sequential pipeline.

2. Does the task value support 4-15× the cost of a chat-style alternative? Multiply your expected single-agent token spend by 15 and ask whether you'd still ship at that price. If no, the multi-agent version isn't economically viable.

3. Does the work exceed what a single context window can hold? If the relevant data fits in 200K tokens, you probably don't need multi-agent for context-capacity reasons.

4. Is there genuine specialization? Different subtasks needing different prompts, different tools, different domain expertise? If yes, multi-agent often pays for itself in quality. If no, one good agent handles it.

If three or four answers are yes, build it. The orchestrator-worker pattern is your starting point, not the swarm or the peer-to-peer mesh. If two or fewer answers are yes, the simpler architecture is the right choice. Most teams who build multi-agent systems prematurely are answering yes to "is it cool?" — which doesn't appear in the decision rule.

What this means for your roadmap

The most successful multi-agent systems in production share a development arc: they started as a working single-agent system, ran into a specific limit, and added a second agent to address that specific limit. The architecture grew because the data demanded it, not because somebody drew it on a whiteboard. Anthropic's Research feature didn't start as a multi-agent system — it became one when the team measured that single-agent was leaving 90% performance on the table for the kind of research questions users were asking.

If you're considering multi-agent for a new project, the strongest move is usually to ship the single-agent version first, measure where it breaks, and then add the multi-agent layer to address the specific failure. The teams who do this end up with multi-agent systems that earn their token bill. The teams who design multi-agent first usually end up with multi-agent systems they can't justify when finance asks why the AI line item is 15× larger than the comparable single-agent project across the hall.

Multi-agent architecture is a powerful tool. It's also a 4-15× cost multiplier. Use it where the math works. Skip it where it doesn't. The architecture earns its seat with the unit economics, not with the diagram.

Frequently Asked Questions

Does multi-agent architecture actually outperform single-agent?

On the right problems, yes — and by published margins. Anthropic reports a multi-agent system with Claude Opus 4 as lead and Claude Sonnet 4 as subagents outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation. The trade-off is that multi-agent systems use roughly 15× more tokens than chat interactions, so they only earn their seat when the task value justifies the cost.

When is single-agent the right answer?

When the task is shaped like a single coherent reasoning chain that a frontier model can handle in one prompt — most coding tasks, most one-shot generation, and tasks where the steps depend on each other and can't be parallelized. Anthropic explicitly notes that multi-agent is a poor fit for coding because there's less parallelism than in research.

What's the simplest multi-agent pattern to start with?

The orchestrator-worker pattern — one lead agent coordinates, several specialized subagents handle parallel subtasks. Anthropic's Research feature uses exactly this shape. Avoid swarm or peer-to-peer patterns until you have a specific reason.

Sources

Anthropic Engineering — How we built our multi-agent research system
arXiv — The Orchestration of Multi-Agent Systems: Architectures, Protocols and Standards
LangChain — Choosing the Right Multi-Agent Architecture
Azure Architecture Center — AI Agent Orchestration Patterns
Amazon Web Services — Guidance for Multi-Agent Orchestration on AWS
IBM Think — What is a Multi-Agent System?

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.