From Prompting to Agent Design: What Changed for People Building with AI

The day-to-day work of building with AI changed more between 2023 and 2026 than most job descriptions have caught up to. If your mental model is still "I write a good prompt and the model does the rest," you're working with a map from two versions ago. The work is no longer about writing one clever instruction. It's about designing a system that plans, acts, checks, and recovers. Most of the craft lives in the gaps between those steps.
This is a practical guide for people who were good at prompting generative AI and now have to build things where the model is the engine, not the product. If you want the big-picture shift from generative to agentic, we covered that here. This article is about the skill change. What you used to do, what you do now, and where to focus if you only have limited time to level up.
The short version
Prompting was writing a sentence to get a good answer. Agent design is writing a system that can run for 20 minutes without a human, touch five different tools, notice when something's off, and stop before it does damage. The model is still doing the language part. You are now responsible for everything around it.
That's the shift. The rest of this article is what that means in practice.
Five mental-model shifts that actually matter
Shift 1: From turns to trajectories
In prompting, you thought about single exchanges. Input, output, done. Maybe you iterated a prompt a few times until the model responded well, and then you froze the prompt and moved on.
Agent design is about trajectories. A trajectory is the full sequence of decisions and tool calls the system makes on a task: step one, observed result, step two, observed result, and so on until the goal is reached or abandoned. Good agents have good trajectories. Bad agents have trajectories that look sane in the first two steps and then drift into nonsense by step six.
What this changes for you: you don't evaluate outputs anymore. You evaluate trajectories. The question isn't "is this answer correct." The question is "at every decision point, did the agent make a reasonable choice given what it knew at that moment." That's a different skill, and it requires different tools. You can't just eyeball the final answer.
The move: log every decision the agent makes, not just the final output. Read five full trajectories end-to-end before you ship anything. I promise you will find surprises.
Shift 2: From prompts to system prompts plus tools plus memory
A prompt used to be one block of text. Agent systems use at least three layers:
- The system prompt defines who the agent is, what it can do, what it's not allowed to do, and how it should think about the task.
- The tool definitions describe the capabilities the agent has: what each tool does, what parameters it takes, and when to use it. The model picks from this menu.
- The memory holds state across steps: the plan, prior tool outputs, running notes, the original user goal.
When something goes wrong, it's usually not the model being dumb. It's one of these three layers being unclear, inconsistent, or missing information the agent needs. A great prompt writer in 2023 could fix almost anything by rewriting the prompt. A great agent designer in 2026 knows which of the three layers to touch.
The insight: in my experience, the most common failure isn't the model misunderstanding the task. It's the tool definitions being sloppy. The model picks the wrong tool because the descriptions of tools A and B overlap, and then it can't recover because the tool names don't match the mental model it built from the system prompt. Fix the tool names and descriptions first. It solves more bugs than any amount of prompt rewriting.
Shift 3: From "does it work on my test" to "does it work on the eval set"
Prompt engineering often ended when the prompt worked on three or four examples you tried by hand. That was fine for generative tasks where the stakes were low and a human was in the loop anyway.
For agents, that threshold is nowhere near enough. An agent running 50 times a day on real inputs is going to hit edge cases you never imagined. The only way to catch them is an evaluation set: a labeled collection of inputs and expected behavior that you run against the system every time something changes. Every model upgrade. Every tool change. Every system prompt rewrite.
The eval set is boring to build. It's the single most important thing you'll make. A team with a mediocre agent and a great eval set will ship something better than a team with a great agent and no eval set. The first team can improve. The second team is guessing.
The trap: writing the eval set after you've built the agent. At that point, you'll unconsciously pick cases the agent already handles. Write the eval set first, before you've seen any outputs. That's the only way it tells you something honest.
Shift 4: From single-turn thinking to loop design
Agentic systems run in loops. Perceive, plan, act, verify, repeat. The loop is the core of the system. If the loop is wrong, no amount of prompt tuning will save it.
Questions you have to answer that didn't exist in a pure prompting world:
- When does the loop stop? Task complete? Budget exhausted? Maximum iterations hit? What's the default?
- What counts as "task complete" in a way the agent itself can verify without a human?
- When the agent gets stuck, does it retry, back off, escalate to a human, or give up?
- What happens if a tool call fails? Retry with the same input, different input, or different tool?
- How does the agent know it's in a loop? A real loop, not just a planned sequence of steps?
Good loop design is less about the language model and more about boring engineering: state machines, exponential backoff, circuit breakers, timeouts. If you're coming from a prompting background, this is probably the skill gap you need to close most urgently. The model is 20% of the system. The loop is 40%. Everything else is 40%.
Shift 5: From asking to constraining
In prompting, you got better results by being clearer and more specific in what you asked the model to do. In agent design, you get better results by being clearer and more specific in what you don't allow the model to do.
Constraints are the load-bearing part. Examples:
- The agent can use any tool in the menu, but cannot call tool X without first calling tool Y.
- The agent can issue refunds up to $50 without human review. Anything above escalates.
- The agent can only touch systems on an allow-list. Any system not on the list is invisible to it.
- If tool Z has failed twice on the same input, the agent gives up and escalates.
Every constraint is a guardrail that keeps the system sane when the model makes a bad choice. Because it will. The whole point of agent design is assuming the model will sometimes be wrong and building a system that stays safe anyway.
The skill stack in 2026
Here's what the job actually looks like now, ordered roughly by how much time it takes.
| Skill | Share of time | What it used to be |
|---|---|---|
| Designing and writing eval sets | 20-25% | Testing prompts by hand |
| Tool design (naming, descriptions, parameters) | 15-20% | Didn't exist |
| Loop and state management | 15-20% | Didn't exist |
| System prompt writing | 10-15% | 80% of the job |
| Observability and logging | 10-15% | Reading the one output |
| Human-in-the-loop design | 10% | Didn't exist |
| Cost monitoring | 5-10% | Didn't exist |
Prompt writing is still in there. It's just 10-15% of the work, not 80%. If you are still spending most of your time wording and rewording the prompt, you're solving the wrong problem.
Where to start if you only have a week
If you're a good prompt engineer and you want to be a good agent designer, here's the short path. It assumes you've already built small things with a language model and know your way around one API.
Day 1-2: Build a tiny eval set. Pick one task. Write 25-50 input-output pairs by hand. Don't use the model to generate them; do it yourself. This is the hardest and most valuable thing you'll do. Everything downstream depends on having ground truth.
Day 3: Read someone else's agent code. Find an open-source agent project that actually runs in production, not a tutorial. Read the loop logic. Not the prompts. The loop. See how they handle tool failures, retries, and termination.
Day 4-5: Build a two-tool agent. Pick a task with exactly two tools. Make it work. Make it fail, then make it recover. Log every trajectory. Read the logs.
Day 6: Run your agent against your eval set. Count the failures. Categorize them. This is the first honest look you'll get at how your agent performs, because your intuition is useless and the three examples you tested by hand are a lie.
Day 7: Fix the top three failure categories. Not by rewriting the prompt. By changing a tool description, adding a constraint, or fixing the loop logic. Run the eval set again.
At the end of the week, you're a junior agent designer. The rest is volume. You need to have built five or six of these before the craft starts to feel natural, and you'll change your mental model at least twice along the way.
What not to do
A few traps that cost real time.
Don't skip the eval set because the task is "simple." Every simple task has edge cases. Without an eval set, you won't find them until a user does.
Don't give the agent too many tools. More tools means more confusion. Most good agents have 3-7 tools. If you're reaching for 15, you're probably trying to do too many jobs in one system. Split it.
Don't confuse "the model got it right" with "the system got it right." The model can pick the right tool and the wrong input and the system still fails. Always grade at the system level.
Don't skip observability. You cannot debug an agent by rerunning it. You debug it by reading what it did last time. If you don't log every decision and every tool result, you're flying blind.
Don't rewrite the prompt when the bug is in the loop. This is the most common wasted hour I see. The model keeps producing a weird output because the loop is feeding it stale memory from step two. Fix the loop. The prompt is fine.
The honest uncertainty
I'll be direct: nobody knows what "agent designer" looks like as a full-time role in five years. The skill stack is still forming. The boundary between engineer, product designer, and prompt writer is moving every six months. If you're reading this in 2027, some of this is probably already out of date, and the next shift is on its way.
What I'm confident about: the skills above will matter for the foreseeable future. Evaluation is forever. Loop design is forever. Tool design is forever. Constraint writing is forever. The specific frameworks, providers, and tool protocols will churn. The underlying craft doesn't.
Prompting was a doorway. Agent design is the room on the other side. Walk in.
Frequently Asked Questions
Is prompt engineering dead?
No. It's just compressed. Prompt writing is now roughly 10-15% of the agent-design job instead of 80%. You still need to write clean system prompts and tool descriptions. You just also need to do four other things that didn't exist in 2023.
Do I need to learn a specific framework to become an agent designer?
No. Frameworks churn every six months. Learn the underlying skills: evaluation set design, tool design, loop control, and observability. Those transfer across every framework and API. Pick whichever framework your team already uses.
What's the single most valuable skill to develop first?
Evaluation set design. Without a good eval set, you cannot measure whether the agent works, and you cannot improve it. Everything else is downstream of this one skill.
How long does it take to get good at agent design?
A few months of real project work, not tutorials. You need to have built five or six agents from scratch, shipped them, debugged them in production, and changed your mental model at least twice. The craft is built by reps, not reading.
Sources
- Anthropic — Effective context engineering for AI agents
- Anthropic — Introducing the Model Context Protocol
- UC Berkeley — Berkeley Function-Calling Leaderboard
- MIT Sloan — Agentic AI, explained
- National Institutes of Health — From prompt engineering to agent engineering

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


