Data Readiness in AI Consulting: The 60-70% Problem

The under-discussed reality of generative AI consulting is that 60-70% of the work is data preparation, not model building. Per Deloitte's analysis on data readiness, "data readiness helps organizations invest in data infrastructure and formulate AI strategies" — and EisnerAmper's foundational framing is even sharper: "AI success depends on data readiness." The model is the small visible part of the engagement; the data work is the iceberg below. Engagements that scope realistically build the data work into the timeline and budget. Engagements that don't end up either over-running or shipping AI on data that isn't ready, which produces fragile systems.
This article is the honest framing. Why data work dominates, how to scope it, what good audits look like, and the contract terms that protect against scope shock.
Why data work dominates
Three structural reasons:
Most business data wasn't designed for AI. Customer records, transaction logs, document repositories, support tickets — these were designed for transactional systems and human reading. AI needs structured, clean, embedded, indexed, and accessible. Bridging the gap is real engineering work.
Domain context is in the data. The AI's value comes from what your data contains specifically. Generic models with no data context produce generic output. Useful AI requires the model to access your data accurately, which requires the data to be in shape.
Data quality issues compound silently. Bad data inputs produce bad AI outputs in subtle ways. The AI confidently produces wrong answers because the underlying data was wrong. Catching this requires the kind of audit and cleanup that's invisible in proposals but load-bearing in deployments.
The 60-70% figure is consistent across enterprise generative AI engagements per practitioner consensus and the data surfaced in Google AI Overview research. The percentage shifts by use case but the dominance of data work is consistent.
What "data work" actually includes
The work is more varied than buyers expect:
Data audit: what data exists, where it lives, what quality it's in, what gaps exist. Usually 1-2 weeks of focused work.
Data sourcing: identifying which systems hold the data needed, what access is required, what permissions are needed. Often slower than expected because of organizational coordination.
Data cleaning: removing duplicates, fixing format inconsistencies, handling missing values, normalizing entries. Usually the largest single time sink.
Data structuring: transforming data from its native format into something AI-usable. For RAG systems, this means chunking documents and creating embeddings. For fine-tuning, this means labeling and formatting training examples.
Data labeling: if the use case requires labeled data, the labeling effort can be significant. Domain experts time, label quality assurance, inter-annotator agreement work.
Data infrastructure: vector databases, embedding pipelines, retrieval logic, feedback capture systems. The infrastructure that makes data continuously usable rather than a one-time export.
Data lineage and consent: documenting where data came from, what consent basis exists, how to handle deletion requests. Usually skipped at SMB scale, essential at enterprise.
Ongoing data refresh: as the business produces new data, getting it into the AI system. Not glamorous; required for any system that doesn't degrade over time.
Each component has its own time and complexity. Lumping them as "data work" obscures the variety.
What a real data audit looks like
A useful data audit produces five outputs. Per Salesforce's data readiness framework, the foundational questions are: is data unified and harmonized, are identities resolved and information current, is there clear ownership and governance, is access provisioned correctly, and can data flow into AI systems without manual intervention. The five audit outputs map to those questions:
Output 1: Data inventory. Every data source relevant to the use case, with location, format, size, owner, access requirements.
Output 2: Quality assessment. For each source: completeness (% of records with required fields), consistency (% of fields with valid values), accuracy (sampling-based estimate), and timeliness (how often updated).
Output 3: Coverage gap analysis. Where does the data not cover the use case? Which customers, products, time periods, or scenarios are missing or thin? Coverage gaps are usually invisible until specific queries surface them.
Output 4: Compliance review. Which data has consent basis for AI use? Which data has restrictions? Which data needs deletion procedures supported?
Output 5: Effort estimate. Based on findings 1-4, what's the realistic data preparation effort? Time, cost, dependencies on internal resources.
This audit takes 2-3 weeks of focused work to produce well. Most AI engagements skip it or produce a 1-page version. The skipping is usually visible later as scope shock.
Common data readiness failures
Failure 1: "We have lots of data." Volume is not readiness. A million customer support tickets in unstructured email format is not the same as a million tickets indexed by issue type with labeled outcomes.
Failure 2: "Our data is clean." Almost always false at any scale. The team that says this hasn't audited recently. Real data quality assessments find issues in 30-60% of records typically.
Failure 3: "We can clean it up later." "Later" rarely happens. Data cleanup deferred during AI development becomes the dominant pain point in production. Better to do it upfront than to manage around it indefinitely.
Failure 4: "The vendor will handle data." Vendors can do data work but the cost is real and often hidden. Ask for explicit data work in the SOW with named deliverables and effort allocation.
Failure 5: "We just need a chatbot." Use cases are connected to data. The chatbot is only as good as the data it can access. Underestimating data work for "simple" use cases is a common pattern.
How to scope data work in a consulting engagement
Three contract structures:
Structure 1: Time-and-materials with milestone reviews. The vendor estimates data work as best they can, charges T&M during execution, and milestones force checkpoints to assess scope drift. Best for engagements where data state is unknown.
Structure 2: Fixed-fee data audit + variable data preparation. Fixed price for the 2-3 week audit, then variable pricing for data preparation based on what the audit reveals. Best for buyers who want certainty on the audit phase but acknowledge that preparation scope is unknown.
Structure 3: Capped fixed-fee with explicit out-of-scope categories. Fixed price for typical data work with explicit list of conditions that trigger change orders (e.g., "if data quality issues exceed 30% of records," "if additional data sources need to be sourced beyond the initial 5"). Best for engagements where the data state is reasonably understood.
The structure to avoid: pure fixed-fee with no data discovery phase. This pattern produces under-delivered data work because the vendor protects margin by cutting cleanup.
Realistic data work timing
For typical generative AI engagements:
4-week pilot or proof-of-concept: 2-3 weeks of data work, 1-2 weeks of model and integration. Data dominates.
8-week focused build: 4-5 weeks of data work, 3-4 weeks of model and integration.
16-week production deployment: 9-11 weeks of data work, 5-7 weeks of model and integration plus governance.
The pattern: data work scales nearly linearly with engagement size. Engagements that show different ratios are either over-promising the model work or under-investing in data work.
When data work is truly minimal
A few cases where data work really is the smaller portion:
Pure prompting use cases. Drafting emails, summarizing documents (where the document is provided per request), generating ideas. The "data" is the prompt itself; the model handles everything else.
Off-the-shelf RAG with prepared content. When using a tool that has its own knowledge base built in (some customer service AI), your data work is integration only.
Single-source structured data. If your use case operates on data from one well-structured system (e.g., your CRM), data work is integration rather than cleanup.
These cases exist but are exceptions, not the norm. Most production AI work hits the 60-70% data figure.
What good vendors do differently
Strong AI consulting vendors:
- Build data audit into early-stage scoping
- Require access to a data sample before fixing scope
- Show past engagements where data work was honestly measured
- Use contract structures that adapt to data discovery
- Document data preparation as a deliverable
Weak vendors:
- Skip data audit and produce vague scope
- Lock fixed-fee before understanding data state
- Refer to data work as "we'll figure that out"
- Cut data preparation when scope pressure builds
- Don't document what data work was actually done
The pattern is visible during scoping. Buyers who watch for it filter strong from weak vendors quickly.
The honest takeaway
60-70% of AI consulting work is data preparation. This is consistent across most production generative AI use cases. Buyers who plan for it scope realistically; buyers who don't get scope shock 4-6 weeks into the engagement.
A real data audit produces inventory, quality assessment, coverage analysis, compliance review, and effort estimate. Two to three weeks of focused work. Worth doing before the build engagement signs.
Contract structures that work: T&M with milestones, fixed audit + variable preparation, capped fixed-fee with explicit out-of-scope categories. Pure fixed-fee with no discovery phase usually produces under-delivered data work.
Plan for the data work. Insist on the audit. The 60-70% is reality whether the engagement acknowledges it or not. Engagements that acknowledge it ship; engagements that don't usually ship something less than promised.
Frequently Asked Questions
Is the 60-70% data work figure accurate across AI use cases?
Yes for most production AI work. The percentage varies — RAG-heavy use cases trend toward 70%+ data work, prompt-only use cases as low as 30%. The Google AI Overview surfaces 60-70% as the standard figure across enterprise generative AI work, and our experience tracks. Use cases where data isn't load-bearing are rare in production.
Can data readiness work be scoped accurately upfront?
Partially. Initial scoping based on a data audit gets you within 30%. Real scope emerges as data preparation begins and unknown problems surface. The honest contract structure assumes scope discovery during the first 2-3 weeks of data work and adjusts accordingly. Contracts that lock data scope upfront usually produce overruns or under-delivery.
Sources
- Deloitte — Transforming AI Outcomes with Effective Data Readiness
- EisnerAmper — AI Data Readiness: Why Data Foundations Matter
- Salesforce — 5 Ways to Measure Your Data Readiness for an AI Agent
- NIST — AI Risk Management Framework
- McKinsey QuantumBlack — The state of AI in 2026
- Stanford HAI — AI Index Report 2026
- Gartner — Generative AI Consulting and Implementation Services

Founder, Tech10
Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.
Read more about Doreid


