Essay AI Deployment 8 minute read

AI in support is a containment problem, not a deflection one.

Why the question that determines whether an AI support deployment works isn't what the AI should handle, but what it must never touch.

Most AI support projects start with the wrong question

"What can the AI handle?" gets asked first, in every kick-off meeting, by every stakeholder. It's the question that shapes the project plan, the metrics, and the slide deck the COO will show the board six months later. It is also the question that produces the worst deployments.

The right question, the one most teams avoid asking, is: "what must the AI never touch?"

I deployed Intercom's Fin AI at a B2B SaaS company. The autonomous resolution rate climbed to a number worth being proud of. The CSAT held at 100% from launch. Out-of-hours coverage opened up without out-of-hours staff. Recruiters and founders read those numbers and assume the project was about teaching the AI to handle more. It wasn't. It was about teaching it to handle less, and to recognise when "less" meant "nothing at all."

This is what I mean when I say AI in support is a containment problem, not a deflection one.

The default approach optimises for the wrong variable

The dominant discourse on AI in support is deflection-led. Higher autonomous resolution. Lower contact volume. Less time per agent. These metrics get reported in board decks. They drive vendor procurement. They are also, in the first six months of a serious deployment, the worst variables to optimise.

Optimising for deflection rewards a specific bad behaviour: pushing the AI into conversations it shouldn't be in. You can grow the resolution rate by lowering the bar for what counts as resolved. You can grow it further by removing the option for the customer to escalate. You can grow it further still by allowing the AI to attempt every question rather than handing off the ones it can't be confident about. None of these moves serve the customer. All of them flatter the report.

The teams who succeed with AI in support invert the question. They define the categories of conversation the AI is forbidden from entering before they define the ones it's permitted to handle. The deflection rate emerges from what's left over. It is a consequence, not a goal.

This sounds obvious when you write it down. It is not how most deployments are run.

What containment looked like in practice

Before the AI handled a single live conversation, I had written four things, none of which would make it into a launch slide.

A comprehensive escalation policy

Defining exactly what triggers a human handoff. Not "if it can't answer", which is vague and rewards confidence over accuracy. Concrete triggers: any mention of cancellation, any reference to a specific commercial term, any sentiment signal above a threshold, any question that touched the legal or security review process.

A tone and length policy

Short answers for short questions. No filler. No pretending to feelings the AI doesn't have. Match the customer's register rather than defaulting to a synthetic cheerfulness.

A scoped audience rollout

The first cohort of users who would see the AI weren't a random slice of traffic. They were a single audience segment whose questions were the most contained and the most answerable. Other audience types didn't see the AI at all for the first phase.

A knowledge base rewrite

Between 60 and 100 internal snippets and articles were rewritten before launch. Not because they were wrong for humans, but because they were ambiguous for a model. A sentence a colleague reads correctly because they have context can be misread by an AI that doesn't. The rewrite was the unglamorous, time-consuming part. It was also the part that determined whether the rest of the project worked.

None of these are exciting. They don't appear in the case study slide. They are the actual work.

The dual-side users problem

The clearest example of why containment matters is a problem specific to the product I was supporting, but the principle applies anywhere AI handles customer queries in a B2B context.

Many users existed on both sides of the platform. Suppliers were completing assessments for their clients to review. Clients were reviewing the assessments their suppliers had completed. The terminology overlapped almost completely. A user asking "how do I review this?" could be a supplier reviewing their own draft, or a client reviewing the supplier's submission. The two questions sounded identical and required entirely different answers.

The AI couldn't reliably tell them apart, because the natural language couldn't reliably tell them apart. So the AI didn't get to answer questions in that category. They went to a human, every time, regardless of how confident the model claimed to be.

This was not a flaw in the AI. It was a feature of having defined the boundary correctly. A deployment optimised for deflection rate would have handed those questions to the AI and accepted some percentage of wrong answers as the cost of higher numbers. A deployment optimised for containment routes them away from the AI entirely, accepts the lower resolution rate, and protects the trust that everyone notices later.

The trade is real. You give up some deflection. You keep customer trust intact. Trust compounds. Deflection rate doesn't.

Before any customer saw it

The other unglamorous part is that I spent the better part of two months batch-testing the AI against historical conversations before any real user saw a single response. Batch testing isn't sexy. It doesn't generate launch metrics. It generated something more useful: a clear map of where the AI was confident and wrong, where it was unconfident and right, and where it was confident and right. The first category is the dangerous one. That's where containment rules get written.

By the time the AI saw a live conversation, I had a strong enough sense of its failure modes to know what to route around it. The launch wasn't a gamble. It was the controlled version of a question I'd already answered.

This is the part most teams skip. They want to ship and measure. They get a deflection rate quickly and a quality problem slowly. The quality problem is harder to clean up than the deflection rate is to build.

After launch, the audit

Containment isn't a one-off project. It's an ongoing discipline. After launch, I audited the AI's conversations continuously, looking for the three failure modes that matter: false confidence, drift from the tone policy, and edge cases the original scoping had missed.

When an edge case appeared, the response wasn't to expand what the AI could do. It was to write a new escalation rule. The default move was always to route more away from the AI when in doubt, never to route more toward it. The autonomous resolution rate would have been higher if I'd done the opposite. The CSAT wouldn't have held.

This is the discipline that gets dropped first when an engineering team takes over the deployment from the function that originally scoped it. The metrics start optimising upward. The customer trust starts eroding downward. The two trends are linked, and the team responsible for the second number is usually not the team adjusting the first.

"The instinct in AI support deployment is to chase what the AI can do. The discipline is to define what it can't."

Where AI actually earns its place

With the boundaries set, the AI was free to be useful where it was actually good. The categories where it earns its keep:

Single-context queries with unambiguous answers. Onboarding questions. Documentation questions where the answer exists, fully, in one place. Status questions. Anything where the right response is the same regardless of which user is asking.
Out-of-hours coverage for these same categories. The alternative isn't a human responding at 2am. The alternative is the customer waiting until morning. The AI doesn't beat a great human agent. It beats no agent at all, which is the actual comparison most of the time.
Knowledge surfacing where the alternative is the customer hunting through documentation. The AI is faster than search, and faster than asking a colleague.

The categories where it isn't good, regardless of how the vendor markets it:

Anything emotionally weighted. Frustration, complaints, churn risk, any sentiment that needs a human to read.
Anything where the same words mean different things in different contexts. Dual-side users are one example. Multi-tenant products, role-based permissions, region-specific terminology, all the same problem.
Anything where the cost of being wrong outweighs the benefit of being fast. Commercial terms, contractual commitments, anything compliance-adjacent. Speed isn't worth much if the wrong answer creates a legal problem.

The metric I'd actually track

Resolution rate is the metric most reported because it's the easiest to define, the easiest to chart, and the easiest to brag about. It is also the most gameable.

The metric I'd track instead is harder to define cleanly, which is part of why nobody tracks it: what proportion of conversations the AI entered ended with a confident customer, regardless of who finished the conversation?

If the AI handles 70% of conversations and 95% of customers come away with what they actually needed, that's a strong deployment. If the AI handles 95% of conversations but 30% of customers come away confused, frustrated, or with a half-answer they then escalate elsewhere, the deflection rate is a vanity number. You've offloaded the work but pushed the cost onto the customer.

The right metric is the one most expensive to measure honestly. That's usually a sign it's the right metric.

What this means for the humans

A well-contained AI deployment makes the rest of the support function more valuable, not less. The humans handle the conversations the AI was never going to handle well. Those conversations are the ones that build commercial trust, save churn risk, and convert escalations into advocacy.

If you've used AI to remove humans from the work that matters, you've optimised the wrong variable. The point of deploying AI well isn't to need fewer humans. It's to free the humans you have for the work where they actually compound your retention. I've seen teams take the opposite path and end up with a deflection number their leadership applauds and a customer base their commercial team is quietly losing.

The test

Here's the question I'd ask any team about to deploy AI in support, or any team partway through a deployment that isn't going as well as the slide deck suggests:

Before you measured what the AI could handle, did you define what it couldn't?

If the answer is no, what you've built isn't a support function. It's a deflection machine. Those are not the same thing, even when the dashboard makes them look similar.

Deploying AI in support, or worried about one that isn't landing?

I've scoped a deployment that held 100% CSAT from launch, and I've seen what goes wrong when containment gets traded for deflection. If you'd like to talk through what good looks like for your specific stage, I'd love to hear from you.

hello@kianpace.me

← Back to portfolio Connect on LinkedIn ↗