Most Teams Don't Need Fine-Tuning. Here's What They Actually Need.

Most teams who tell me "we need to fine-tune" are wrong. Not a little wrong. About 90% wrong. In a decade of building AI systems for mid-market companies, I can count on one hand the projects where fine-tuning was the right first move. Everywhere else, the problem was something cheaper, faster, and more boring.
That's my view. Here's why.
The meeting that keeps repeating itself
The meeting usually opens the same way. Someone says, "We tried ChatGPT. It didn't understand our business. We need to fine-tune a model on our data."
The sentence has three load-bearing claims, and they're all worth interrogating.
"It didn't understand our business." Did you give it a system prompt describing your business? Did you include two worked examples? If no, the model isn't the problem. The prompt is.
"Our data." Which data? Documents, structured records, historical decisions, style examples? Each of these wants a different tool, and only one of them is fine-tuning territory.
"We need to fine-tune." Based on what? The team usually names fine-tuning because it sounds like the real answer. It's specific. It's technical. It's concrete in a way that "write a better prompt" isn't. That's the appeal, and it's also the trap.
I used to sit through these meetings politely. Now I stop them. "Before we talk about fine-tuning, walk me through the prompt you wrote and the eval set you ran it against." Most teams don't have one. Some don't have either. That's where the real conversation starts.
The three questions that end most fine-tuning projects
Run these before you draft a statement of work. If you can't give a clean answer to all three, you don't have a fine-tuning project yet.
Question 1: What does "good" look like, measured?
Not described. Measured. Scored out of 10. Graded by a rubric. Compared against a baseline. Evaluated on a specific set of 50 cases by at least two people who agree.
Most teams can't produce this. They have a vibe. They know good when they see it. But when you ask for 50 graded examples, the conversation shifts. Sometimes they produce a good set in a week. Most of the time the set reveals that "good" means different things to different stakeholders, and now you have a specification problem, not an AI problem.
Fine-tuning without an eval set is $20,000 lit on fire. You'll ship the model, it'll feel different, somebody will say "no, not that different," and you won't know how to adjust.
Question 2: Did a well-written prompt fail?
A well-written prompt means: 300 to 800 words of system prompt that describes the task, the constraints, the format, and the voice; two to three complete worked examples inside the prompt; temperature and other sampling parameters tuned; a fallback for ambiguous input.
Most teams haven't done this. They tried a 50-word prompt, saw weak output, and concluded the model can't do it. Which is like concluding your new hire can't do the job because you didn't describe the job.
If you haven't run the prompt experiment, there's nothing to fine-tune. Go do that first. Most times, you won't come back.
Question 3: Is the problem behavior or knowledge?
If the model gets the format right but uses the wrong facts, that's a knowledge problem. RAG.
If the model gets the facts right but the format wrong, that's a behavior problem. Fine-tuning, maybe.
If the model gets both wrong, you have a prompt problem first. Don't skip that step.
This question alone disqualifies maybe 60% of "fine-tuning" projects. Teams confuse knowledge problems for behavior problems constantly. They think the model doesn't know their company, when really the model doesn't have access to their documents.
What people actually need when they say they need fine-tuning
In order of how often it comes up:
1. A better prompt (50% of cases). I've seen projects saved by 15 minutes of prompt rewriting. One client was about to spend $25,000 on a fine-tune because their chatbot "didn't understand their product." The problem was a 40-word prompt that didn't include a single product name. Expanded to a 600-word prompt with three worked examples, accuracy went from 60% to 90% on the eval set they hadn't built yet.
2. RAG (30% of cases). The team thinks they need fine-tuning because the model doesn't know their internal docs. They actually need retrieval. RAG solves the knowledge problem in two weeks instead of the two months a fine-tune takes. We go through this in RAG vs fine-tuning: which one solves your problem.
3. A better eval set (10% of cases). The team has been iterating on prompts for weeks with no clear sense of whether they're getting better. An eval set of 50 cases across the task's diversity, with a clear rubric, would tell them in an afternoon whether version 7 beats version 6. Without it, they're guessing.
4. A different model (5% of cases). Sometimes the model really is the issue. The team was using GPT-4o-mini when the task needed Opus 4.6 or GPT-5. Upgrading solves it. This gets diagnosed in 20 minutes if you know to look for it.
5. Actual fine-tuning (5% of cases). The remaining cases where fine-tuning is genuinely right are the ones covered in when fine-tuning actually beats RAG. Structured output at scale, domain jargon, style at volume, high-throughput cost optimization, adversarial refusal behavior. Real cases, but rare.
That's not a theoretical distribution. That's what comes through the door.
Why fine-tuning keeps getting picked anyway
Four reasons.
It sounds like the serious answer. "We fine-tuned a model on our data" is a sentence you can say at a board meeting. "We wrote a better prompt" isn't, even if it's the better answer. Organizations reward complexity over simplicity, and fine-tuning is complex enough to look like real AI work.
Vendors sell it. Fine-tuning is a revenue line for API providers. Every vendor has a fine-tuning product. Every vendor has success stories. None of the success stories are framed as "actually you could have solved this with prompting, but we took your money."
It feels like control. Fine-tuning gives you a model that's "yours." Never mind that the underlying architecture is still the vendor's and they can change it on Tuesday. The feeling of ownership is valuable to teams that don't trust vendor behavior, even if the ownership is largely illusory.
Nobody budgets for prompt engineering. A budget line for "fine-tuning project: $30,000" gets approved. A budget line for "prompt engineering and eval set: 40 hours" gets questioned. The economics of approval push teams toward the bigger, more expensive option. Which is exactly backwards.
What I'd do in the first two weeks instead
Say you landed a new AI project at a mid-market company. The team thinks they need fine-tuning. Here's how I'd spend the first two weeks.
Week 1: Define the task and build the eval set.
Map the workflow. What's the input, what's the output, what does good look like? Get 50 real examples with graded target outputs. If you can't get 50, get 20 and add more as you go. Run the eval against the stock model with a baseline prompt to establish where you're starting.
Week 2: Prompt and RAG baseline.
Rewrite the prompt carefully. Add worked examples. Run the eval. Score it. If quality clears the bar, you're done. If it doesn't clear the bar but the failures are knowledge-based, build a minimal RAG stack over the relevant documents. Run the eval again.
At the end of week 2, one of three things is true:
- Quality cleared the bar. Ship it. Fine-tuning never came up.
- Quality's close but not there. Look at the specific failure modes. Usually this is prompt refinement, not fine-tuning.
- Quality can't close the gap with prompting or RAG. Now we have a real fine-tuning case, and we have the eval set to validate it.
That third case happens maybe 1 in 10 projects. For the other 9, the team saved $20,000 to $40,000 and shipped in two weeks instead of two months.
Where fine-tuning actually earns its keep
I want to be careful not to argue fine-tuning never works. It does. The cases I'd pay for it in 2026:
- 40,000+ daily structured output calls where format accuracy has to be 99%+
- Domain vocabulary where the base model is consistently wrong in ways a prompt can't fix
- Rejection and refusal behavior in regulated environments
- High-volume classification at a cost where a smaller fine-tuned model beats a frontier model with a big prompt
That's it. A short list. The other ninety-something percent should be prompt engineering and RAG.
The honest version of AI work in 2026 is this: the model is 10 to 20% of the project. The workflow, the data, the eval set, and the prompt are 80 to 90%. Fine-tuning tries to replace some of the latter with more of the former, and it almost always costs more than the alternative.
I'd rather a team ship a prompt-engineered version in two weeks and learn the workflow doesn't actually deliver value than spend three months on a fine-tune for a process that should have been killed. That's the frustrating part of most AI projects. The thing people want to optimize last is the thing they should have defined first.
The short version
- You probably don't need fine-tuning
- You probably need a better prompt, an eval set, or RAG
- The three questions before you fine-tune: What does good look like, measured? Did a well-written prompt fail? Is the problem behavior or knowledge?
- If you can't answer all three, don't fine-tune yet
- If you can answer all three and the case still stands, fine-tune
Save the $20,000. Spend it on eval infrastructure. You'll use that forever.
Frequently Asked Questions
How do I know if my problem really needs fine-tuning?
Run the three-question test. If you don't have a scored eval set, if you haven't tried a careful prompt, or if the problem is missing knowledge rather than wrong behavior, you don't have a fine-tuning problem yet. You have something else.
What's the cheapest path to better AI output?
In order: a better prompt, an eval set, RAG, a different model, fine-tuning. Almost always in that order. Teams who reverse the order spend 5 to 10 times what they need to.
Is prompt engineering going away?
No. The name changes, the techniques evolve, but the practice of telling a model exactly what you want with carefully-chosen examples will outlast every current model version. It's the cheapest skill in AI work with the biggest payoff.
What if my team already tried prompting and it didn't work?
Ask for the prompt. If it's under 200 words and doesn't include worked examples, it's not a serious attempt. A real prompt-engineered version looks like 500 to 1,000 words of system instructions plus two or three complete example exchanges. Most 'we tried prompting' teams tried a small fraction of that.
Sources
- Microsoft Research on arXiv — Fine-Tuning or Retrieval? Comparing Knowledge Injection
- arXiv — Fine Tuning LLMs for Enterprise: Practical Guidelines
- Anthropic — Anthropic Claude API Pricing
- OpenAI — OpenAI API Pricing

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


