Scoring AI Partners on Technical, Governance, and Operating Dimensions

Scoring Ai Partners Technical Governance Operating

AI ConsultingMay 25, 20267 min readDoreid Haddad

In this article

The partner evaluation framework names five dimensions; this article goes deep on the three that are most load-bearing for typical engagements: technical capability, governance maturity, and operating model. Each has detailed sub-criteria, evidence requirements, and scoring rubrics that turn impression into defensible measurement.

Technical Capability: detailed scoring

Five sub-criteria, each scored 0-5.

Sub-criterion 1.1: AI-native architecture (weight 20% within dimension).

5: Product architected around AI from day one. Eval pipelines and AI observability are core infrastructure. Demonstrated through architectural artifacts.
4: Strong AI focus with most native architecture in place. Some legacy elements that don't yet reflect AI-native thinking.
3: AI features well-integrated but not architecturally native. Eval and observability work but added rather than designed in.
2: AI features bolted onto existing product. Eval and observability ad hoc.
1: AI as marketing layer over fundamentally non-AI product.
0: Cannot demonstrate any AI architectural thinking.

Evidence required: architectural diagrams, eval pipeline screenshots or documentation, observability stack walkthrough.

Sub-criterion 1.2: Eval discipline (weight 25% within dimension).

5: Continuous CI-integrated eval pipeline. Detailed eval sets per category. Regression catching as standard practice. Artifacts available under NDA.
4: Strong eval discipline with eval sets, regular runs, regression detection. Some manual elements.
3: Eval sets exist; runs happen periodically; regression sometimes caught.
2: Eval discussed in proposals but artifacts vague or missing.
1: "We test internally" without specific methodology.
0: No eval discipline visible.

Evidence required: eval set examples (redacted), regression catch examples, eval pipeline architecture.

Sub-criterion 1.3: Production track record (weight 20%).

5: 5+ named production deployments running 12+ months at comparable scale. References speak openly.
4: 3+ named production deployments. References responsive.
3: 1-2 named productions. Many pilots.
2: Mostly pilots with limited production proof.
1: Demos and consulting reports rather than running systems.
0: No verifiable production track record.

Evidence required: named references, ongoing production metrics, reference call notes.

Sub-criterion 1.4: Integration depth in your stack (weight 20%).

5: 5+ deployments integrated with your specific CRM/data warehouse/etc. Code samples available.
4: 2-3 deployments with named systems matching yours.
3: General integration claims with one or two specific examples.
2: Integration discussed generically without specific stack experience.
1: First integration with your stack would be on this engagement.
0: No integration capability demonstrated.

Evidence required: integration documentation, code samples, customer references with your stack.

Sub-criterion 1.5: MLOps maturity (weight 15%).

5: All eight practices from the MLOps maturity guide demonstrably in place.
4: 6-7 of eight practices in place. Few gaps.
3: 4-5 practices in place. Significant gaps in some areas.
2: 2-3 practices. Most missing or vague.
1: One or two practices ad hoc.
0: No MLOps maturity visible.

Evidence required: MLOps documentation, monitoring screenshots, incident response examples.

Governance Maturity: detailed scoring

Six sub-criteria.

Sub-criterion 2.1: Regulatory mapping for your sector (weight 25%).

5: Vendor names specific regulations applicable to your sector unprompted. Has documented mapping for similar past engagements.
4: Vendor knows regulations when prompted. Has framework for mapping.
3: General awareness of regulations; mapping happens during engagement.
2: Generic compliance language; specific regulations vague.
1: Compliance treated as buyer's responsibility.
0: Unaware of basic applicable regulations.

Evidence required: sample regulatory map from past engagement, list of regulations vendor has navigated.

Sub-criterion 2.2: Audit trail design (weight 20%).

5: Defined audit trail architecture with all required fields (input, system prompt version, model version, retrieved context, output, tool calls, timestamps). Reconstruction tested.
4: Strong audit trail with most fields. Some gaps.
3: Basic audit trail with input/output logging.
2: Audit trail discussed but design vague.
1: No audit trail design.
0: Audit trail not addressed.

Evidence required: audit trail architecture document, sample log records (redacted).

Sub-criterion 2.3: Bias and fairness testing (weight 15%).

5: Standard methodology with documented bias benchmarks, demographic disparity testing, adversarial red-teaming.
4: Methodology covers most concerns. Some gaps in scope.
3: Bias testing conducted but methodology informal.
2: Bias mentioned but not systematically tested.
1: No bias testing methodology.
0: Bias as concept not addressed.

Evidence required: sample bias test report, methodology document.

Sub-criterion 2.4: Model documentation (weight 15%).

5: Model cards / system cards as standard deliverable. Sample available.
4: Documentation produced for past engagements. Available on request.
3: Documentation discussed; not standard practice.
2: Documentation as ad hoc deliverable.
1: No documentation practice.
0: Model documentation not addressed.

Evidence required: sample model card, documentation template.

Sub-criterion 2.5: Incident response procedures (weight 15%).

5: Documented incident response plan with criteria, roles, communication templates. Sample report available.
4: Plan exists with most elements.
3: Plan discussed; documentation light.
2: Incidents handled ad hoc.
1: No incident response thinking.
0: Topic not addressed.

Evidence required: incident response plan template, sample incident report (redacted).

Sub-criterion 2.6: Data lineage and consent (weight 10%).

5: Documented data lineage with consent basis, retention policies, deletion procedures.
4: Lineage tracking with most elements.
3: Basic lineage thinking.
2: Topic discussed without specifics.
1: No lineage practice.
0: Topic not addressed.

Evidence required: sample data lineage record, consent documentation.

Operating Model: detailed scoring

Five sub-criteria.

Sub-criterion 3.1: Engagement structure (weight 25%).

5: Detailed engagement structure with named phases, milestone-based deliverables, explicit out-of-scope items.
4: Structure documented with most elements.
3: Engagement structure described; specifics light.
2: Engagement structure vague.
1: No clear structure.
0: Engagement description marketing-only.

Evidence required: sample engagement structure from past engagement, deliverable list.

Sub-criterion 3.2: Communication and escalation (weight 15%).

5: Defined cadence (weekly steering committee, daily standup, etc.), named escalation path, response time SLAs.
4: Most elements defined.
3: General communication approach.
2: Communication TBD during engagement.
1: Light communication culture.
0: Topic not addressed.

Evidence required: sample communication plan, response time commitments.

Sub-criterion 3.3: Pricing structure (weight 25%).

5: Transparent milestone-based pricing with documented assumptions and explicit handling of scope changes.
4: Mostly transparent with small gaps.
3: Time-and-materials with milestone reviews.
2: Fixed-fee on scope that's underspecified.
1: Pricing structure punishes growth or has hidden cliffs.
0: Pricing opaque or unstable.

Evidence required: sample pricing schedule, change order pricing process.

Sub-criterion 3.4: Post-deployment support (weight 20%).

5: Defined transition with documented runbook, optional ongoing tier, clear separation of build and operate.
4: Strong transition; some elements light.
3: Transition addressed; ongoing support implied.
2: Transition vague; ongoing dependency expected.
1: No transition plan; ongoing engagement assumed.
0: Topic not addressed.

Evidence required: sample transition plan, runbook template.

Sub-criterion 3.5: Vendor risk and sub-processors (weight 15%).

5: Documented sub-processor list, contractual terms covering them, security review of each.
4: List exists with most elements.
3: Sub-processors discussed; documentation light.
2: Sub-processors implicit; not explicitly managed.
1: Sub-processors not addressed.
0: Topic not addressed.

Evidence required: sub-processor list, security review documentation.

How to apply the scoring

Score each sub-criterion 0-5 with documented evidence reference. Compute dimension score as weighted average of sub-criterion scores. Compare across vendors.

For most engagements, the lowest-scoring dimension drives risk. A vendor strong on technical and governance but weak on operating model will deliver good work that's difficult to manage. A vendor strong on technical and operating but weak on governance will deliver work that fails compliance review.

The scoring is most useful for surfacing which dimension is the weakness, not just for picking the highest total. Mitigate weak dimensions through contract terms or third-party support.

The honest takeaway

Detailed scoring across technical capability, governance maturity, and operating model produces defensible decisions. Each dimension has 5-6 sub-criteria with explicit rubrics and evidence requirements.

Most evaluations skip this rigor. The skipped rigor produces decisions that look reasonable in the moment but don't predict delivery. Detailed scoring takes more time per vendor but the time is well spent — the cost of choosing wrong is dramatically higher than the cost of scoring carefully.

Apply the rubrics. Document evidence. Score consistently across vendors. Look at the lowest sub-criterion score for each vendor — that's where engagement risk concentrates. Mitigate with contract terms. Decide on the documented evidence rather than the impression.

Frequently Asked Questions

Which dimension is most predictive of engagement success?

Technical capability for build engagements, governance maturity for regulated-industry engagements, operating model for ongoing managed services. Most engagements are mixed, so all three matter. Score them all and look at the lowest score — that's usually where the engagement will struggle.

Should I share my scoring with the vendor?

Yes, the dimensions and weights but not the specific scores. Sharing dimensions and weights lets the vendor understand what you care about and tailor their evidence accordingly. Sharing specific scores gives them too much leverage to negotiate the scoring rather than the substance.

Sources

OWASP Gen AI Security Project — OWASP Vendor Evaluation Criteria for AI Red Teaming Providers & Tooling
Info-Tech Research Group — Generative AI Vendor Selection Criteria Workbook
Atlas Systems — AI Vendor Risk Assessment Questionnaire for Compliance
Data & Trusted AI Alliance — AI Vendor Assessment Framework
NIST — AI Risk Management Framework
Gartner — Generative AI Consulting and Implementation Services
Anthropic Research — Building Effective Agents
Stanford HAI — AI Index Report 2026

Written byDoreid Haddad

Founder, Tech10

Doreid Haddad is the founder of Tech10. He has spent over a decade designing AI systems, marketing automation, and digital transformation strategies for global enterprise companies. His work focuses on building systems that actually work in production, not just in demos. Based in Rome.

Scoring AI Partners on Technical, Governance, and Operating Dimensions

Technical Capability: detailed scoring

Governance Maturity: detailed scoring

Operating Model: detailed scoring

How to apply the scoring

The honest takeaway

Frequently Asked Questions

Which dimension is most predictive of engagement success?

Should I share my scoring with the vendor?

Keep reading

Months 4-6 of SMB AI: Building Discipline That Compounds

Months 2-3 of SMB AI: Expanding from Pilot to Working Stack

The First 30 Days of AI Adoption: What to Do Week by Week

The Small Business AI Roadmap: A 6-Month Plan from Zero to Productive