How to Build an AI Agent Without Overengineering It

AI AgentsMar 27, 20267 min readDoreid Haddad

In this article

Build an AI agent by starting with a single LLM call, a clear task contract, and the smallest set of tools you can get away with — then add complexity only when the data tells you to. That sequence isn't a personal opinion; it's the conclusion of Anthropic's December 2024 review of dozens of customer agent builds, which found that "the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."

The reason that finding matters: most "how to build an AI agent" content tells you the opposite. The current Google AI Overview lists n8n, Crew.ai, and AutoGen on the front page. The top YouTube videos in the SERP open by spinning up multi-node agent canvases. Beginner tutorials reach for orchestration frameworks before the first prompt is written. None of this is wrong, but it skips the part where you decide whether you need an agent at all.

This is the version that doesn't skip that part.

Step 0: Decide if you actually need an agent

Anthropic draws an architectural line that most tutorials blur. A workflow is a system where LLMs and tools run through predefined code paths. An agent is a system where the LLM dynamically decides which tools to use and what to do next. Workflows are predictable. Agents are flexible. They are not the same thing and they are not priced the same.

The honest first question is whether your task needs the dynamism of an agent. If the steps are fixed — read this email, classify it, file it — what you want is a workflow with an LLM step inside, not an agent. Build that. It will be cheaper to run, easier to debug, and faster to ship. The same Anthropic post says it directly: "for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."

A simpler test, borrowed from how Matt Wolfe frames it in his 2026 walkthrough: ask whether 90% accuracy would be acceptable. Low-precision tasks (research summaries, draft emails, content triage) are the right starting point for an agent. High-precision tasks (refunds, accounting, anything regulated) need 98%+ accuracy, which Wolfe's data suggests takes around six months of edge-case work to reach. Don't start there.

Step 1: Write the contract before you write the prompt

Three lines, on a wall, before any code. Input: what comes in. Output: what goes out. Done: how you know the agent finished. A working contract for a support triage agent: input = customer email plus account record. Output = (category, draft reply, confidence score). Done = either the three fields exist or the case routes to a human.

The contract drives the eval set, which drives every change after launch. Skip it and you're tuning prompts based on vibes. Most of the agent projects I've watched stall hit that wall in week three.

Step 2: One model, no framework

The AI Overview suggests starting with frameworks. The Anthropic data suggests the opposite. Their explicit advice: "we suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code."

For practical purposes that means: pick one model, write a Python script, call the API. Claude Sonnet 4.6 handles the common case for most mid-market agents — tool use, structured outputs, decent reasoning, fast enough for interactive workflows. GPT-5 is fine too. The model is 10-20% of the project's total cost; the workflow design and the human review around it are the rest. Picking the model is rarely the decision that determines whether the project ships.

What you're skipping at this stage:

No vector database. A 200K-token context window holds about 600 pages of text. Most first-version agents fit.
No orchestration framework. One agent doesn't need one. A loop with a tool list does the same job in 80 lines of Python.
No agent personality or branded chat UI. The agent is software. Treat it like software.
No fallback model. You haven't measured failure rates yet.

Step 3: Three tools, named narrowly

Tools are functions the model can call. Read a customer record. Send an email. Look up an order. Each one is a security boundary, a maintenance commitment, and a slot of attention the model has to manage. The Berkeley Function-Calling Leaderboard has shown for two years that function-calling accuracy degrades as the tool list grows past about a dozen.

Pick three. Read, write, decide is the canonical set:

A read tool fetches the context the agent needs.
A write tool produces a structured output the rest of your system can consume.
A decide tool routes the case to the right next step.

Trim the read returns aggressively. A customer record with 80 fields will inflate token usage and confuse the model when you only needed five fields. Validate every output through a strict schema — Pydantic in Python, Zod in TypeScript. If the output doesn't match the schema, fail loudly. Don't pass malformed objects downstream.

Step 4: Ship to one user in week two

Long internal development is where agent projects go to die. You write prompts you'll never use. You handle edge cases nobody hit. Real users break the system in ways you couldn't have predicted, which is exactly what you need.

Pick the cleanest 80% of inputs. Build the smallest version. Open it to one real person in week two. The first day of real input will tell you more than the previous month of speculation. Wolfe calls this graduated autonomy — full visibility at first, then add automation as reliability is proven. It's the same pattern.

The 10/20/70 rule that keeps getting asked about

People Also Ask shows the same question on this SERP repeatedly: what is the 10/20/70 rule for AI? The shorthand, popularized by McKinsey, says 10% of the work is the algorithm or model, 20% is the technology stack around it, and 70% is the people, processes, and adoption. For agent projects specifically, that ratio holds well in my experience. The model rarely is the thing that decides whether the project succeeds. The thing that decides is whether the team defined the workflow honestly, set up evals, and ran the operational discipline after launch. That's the 70.

If your roadmap has the 10 and ignores the 70, the project will stall. If it has the 70 and the 10 is "use whatever frontier model fits," it'll ship.

When you do need the heavy machinery

There's a moment when frameworks earn their seat. Per Anthropic's own patterns:

Routing pattern (cheap classifier sends inputs to specialized downstream agents) earns its keep when workloads have tiered difficulty and model cost matters.
Pipeline / orchestrator-workers patterns earn their seat when steps are genuinely sequential and need different expertise.
Parallelization (sectioning or voting) earns its seat when subtasks are independent and you need speed or higher confidence.

Until your single-agent prototype is in production and you can name the specific reason it's hitting a ceiling, none of these are the right move. The OpenAI practical guide to building agents recommends a similar progression: start with one agent, expand only when you've documented why. The practical guide explicitly notes that an approach that works well is to start with the most capable model to establish a baseline, then test whether smaller models can clear that bar. Same logic, different vendor, same answer.

A working build sequence for week one

If you're starting tomorrow, here is the literal sequence I would run with a team:

Pick one workflow your team runs every week. Write the contract in three lines.
Build a 50-example test set from real production cases. This is your eval.
Write a single prompt for one model — Claude Sonnet 4.6 or GPT-5. No framework yet.
Add three tools, each defined with a typed schema for output validation.
Wrap the call in a Python loop with a hard cap on iterations (10 is reasonable).
Run the eval, fix the prompts that fail, re-run.
Ship to one real user. Watch what breaks.
Iterate from real failures, not imagined ones.

Two weeks. That's the version that ships. The version with LangGraph, Crew.ai, a vector store, and a "personality prompt" usually doesn't.

The most powerful AI is powerful because of the thinking behind it. The agents that work in production are the ones built by teams who decided what "working" meant before they decided which framework to use. Build small. Validate the simple thing. Add complexity only when the data demands it. That's the version of "how to build an AI agent" the AI Overview leaves out — and the version that, according to the people whose business it is to know, actually ships.

Frequently Asked Questions

Do I need a framework like LangChain, Crew.ai, or AutoGen to build my first agent?

No. Anthropic's review of dozens of agent builds explicitly recommends starting with LLM APIs directly because frameworks 'create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug.' Many patterns can be implemented in a few lines of code. Add a framework only when you can name a specific reason your simple version is hitting a ceiling.

Can I build an AI agent without coding?

Yes — platforms like Zapier and n8n let you build production-grade agents through visual interfaces with no code. They are often the right tool for sales triage, lead enrichment, and other low-precision workflows. They are usually the wrong tool when you need deep customization, version control, or strict latency budgets.

What is the 10/20/70 rule for AI projects?

It's a McKinsey framing: 10% of the work is the algorithm or model, 20% is the technology stack, and 70% is people, processes, and adoption. For agent projects, this ratio holds well in practice — the model rarely decides whether a project ships. The 70 (workflow definition, eval discipline, operational handover) does.

How long should the first version of an AI agent take to build?

Two weeks to a working version with one user. Anything longer and you're probably building too much before reality has a chance to correct your assumptions. Use the first two weeks to prove the contract, ship to one person, and iterate from real failures.

Sources

Anthropic Research — Building Effective Agents
OpenAI — A practical guide to building agents
Microsoft Learn — AI Agents for Beginners
UC Berkeley — Berkeley Function-Calling Leaderboard
McKinsey QuantumBlack — The state of AI in 2026
NIST — AI Risk Management Framework

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.