Case Study Quality Assurance B2B SaaS, Cybersecurity

A 5-dimension QA framework for AI and humans, held to the same standard.

A walkthrough of how I designed a weekly scoring framework that keeps support quality consistently high, surfaces training and coaching needs proactively, and treats AI and human responses as part of the same operating system rather than two separate things.

Quality dimensions, scored 1 to 5

Weekly

Sample reviewed and scored

Both

AI and human responses, same rubric

Why I built this

Most support quality frameworks are built to assess humans. Once AI enters the conversation, those frameworks tend to fall over, and most teams either invent something new for the AI or stop measuring quality entirely on that channel.

I wanted to avoid both. AI quality drifts the same way human quality drifts: slowly, invisibly, and with real consequences for trust if no one's looking. An AI that's allowed to roam freely without monitoring will erode customer trust just as effectively as a human can, and arguably faster, because there's no individual conscience pulling it back into line.

I built this on my own initiative, not in response to a complaint or a leadership request. The goal was to:

Identify gaps in knowledge before they become misinformation patterns at scale
Catch tone and audience drift early, both for AI and for the team
Create a structured base for coaching, onboarding, and ongoing development
Hold the entire support function (AI and human) to the same defensible quality standard

The five dimensions

Each conversation in the weekly sample is scored 1 to 5 across five dimensions. The dimensions were chosen specifically so they apply equally to AI and human responses.

Accuracy

Is the content being suggested factually correct and current? No making up answers, no assuming context, no pulling from outdated sources. The single most important dimension because everything downstream of it relies on it being right.

Completeness

Did the response actually enable the user to move forward? A factually accurate but partial answer can leave a user with more questions than they started with. Completeness asks whether the customer feels empowered, not just answered.

Tone

Does the response match brand voice and audience expectations? Is it friendly without being casual, professional without being cold, using the right terminology for the user's role and context? Tone is what makes support feel like part of the product rather than an interruption to it.

Efficiency

Is the response asking the right questions, troubleshooting in the right order, and not leading the user astray? Efficiency isn't speed for its own sake. It's whether the path to resolution is the shortest defensible one.

Documentation

This dimension is two-sided. Internally, it asks whether documentation exists for how the AI operates, how to disable or escalate from it, and how the team should handle each scenario. Externally, it asks whether the AI has the documentation it needs to succeed: are the help articles in place, is the self-serve coverage adequate, did we give it the right access to do the job.

If documentation is the gap, the failure isn't with the AI. It's with what we gave it.

How the scoring actually works

The framework runs on a weekly review cadence. Conversations are sampled, reviewed, and scored against the five dimensions. The scoring stays straightforward: 1 to 5 on each dimension, with notes attached where context matters.

What makes the framework actually useful, rather than another report nobody reads, is who's brought into the conversation:

I do the scoring myself, which keeps calibration consistent and the bar steady
If a teammate was involved in the conversation, I bring them in to discuss it directly. Scoring without dialogue is just judgement; scoring with dialogue is coaching
If the conversation involved a client and the scores aren't strong, I bring in their Customer Success Manager. That gives CS visibility into emerging risk before it surfaces at renewal, and gives the support discussion a commercial frame rather than a purely operational one

This rhythm means strengths and weaknesses surface early enough to be addressed proactively, and patterns become visible across weeks rather than buried in individual tickets.

The hardest parts

Three things were genuinely difficult to get right.

Recalibration is the work, not a side effect

A scoring rubric that doesn't get recalibrated is a broken rubric. Customer expectations change. Product complexity changes. AI capability changes. I treat ongoing recalibration as part of the framework itself, not an annual exercise. A model that doesn't improve is a model that's actively decaying.

Initial pushback

There was some initial pushback when I introduced the framework, which is fair. Being scored is uncomfortable. The framing that worked was honest: this isn't about catching anyone out, it's about making sure the team gets stronger onboarding, sharper training, and more specific coaching. Once that landed, the pushback shifted into engagement.

Making subjective dimensions objective

Tone and Efficiency are the dimensions where reasonable people can disagree. The fix wasn't to remove the subjectivity, it was to discuss it openly. When a score on Tone gets challenged, that conversation itself sharpens what "good" actually looks like in our context, and the rubric gets clearer for everyone.

Operating principle

Score for trends, not for punishment. The point of the rubric is to make the function better, not to keep a record of who got what wrong.

What it's actually delivered

The framework has driven real, structural change to how the function operates:

Sharper onboarding, with new hires given a clear definition of what good looks like across all five dimensions from day one
More specific coaching, with 1:1s grounded in actual scored examples rather than impressions
Visible knowledge gaps, where patterns of low Completeness or Documentation scores point directly at help articles or self-serve content that needs improving
Upgraded internal documentation, with new policies and operating notes written specifically because the framework surfaced where they were missing
Wider business insight, where the patterns surfacing in QA reviews feed into Product, Engineering, and CS conversations about what customers actually need
A defensible quality bar for AI, which makes it possible to expand AI coverage with confidence rather than scaling something nobody is monitoring

The principle I'd carry to any company

If I were rebuilding this somewhere else, the most important thing wouldn't be the rubric itself. It would be how the rubric gets created.

QA frameworks fail when they're imposed top-down and treated as a compliance exercise. They work when they're built collaboratively, with the team involved in defining what good looks like, what each dimension actually means in context, and where the bar should sit. A scorecard the team helped design is one the team will defend. A scorecard handed to them is one they'll quietly resent.

The more knowledge in the room when the rubric is being built, the stronger the rubric becomes. And the stronger the team becomes alongside it.

Why this matters more than it looks

A QA framework is, on the surface, an internal ops tool. But what it really is, is the document that makes a support function defensible at the leadership level. When someone asks "how do you know your AI is working?" or "how do you maintain quality at scale?" or "how do you onboard people consistently?", the answer isn't an opinion or an instinct. It's the rubric, the scoring history, and the patterns it's surfaced.

That's the difference between a support function that's trusted to operate at scale, and one that's questioned every quarter.

Want to talk about scaling support?

Whether you're building a QA framework from scratch, scaling AI safely, or thinking through how to make support quality defensible to leadership, I'd love to hear from you.

hello@kianpace.me

← Back to portfolio Connect on LinkedIn ↗