How an AI Cost-Optimization Routing Layer Quietly Broke Our Product

We cut our AI inference bill by more than half last quarter. Eight weeks of clean engineering work. It was the win the engineering team had been chasing all year. It was also the wrong optimization. Three months later, customer satisfaction was dropping, churn was ticking up, and the cost savings were structurally tied to the quality loss. We had not won. We had just moved the cost somewhere we were not measuring.

This is the pattern I expect to see across production AI deployments over the next six months. The current conversation around AI economics has produced a consensus playbook: route simple queries to cheap models, keep expensive queries on capable models, cut the bill, maintain the quality. Every CFO has seen the math. Every engineering team has built it or is building it.

The math is real. The Pareto trap is also real.

What follows is what I told the team after we ran the post-mortem. It describes the architecture they built, the failure mode they walked into, the detection methodology that would have caught it earlier, and the architectural pattern they should have built instead. It also covers two other deployments I audited after this one, in which the same pattern appeared across different industries. The combined evidence is that cost-optimization routing layers, in the shape the consensus playbook prescribes, are structurally fragile in production.

What We Built

The team operated a customer support AI agent for a SaaS product with roughly 4 million monthly active users. The agent ran on a single capable model — the highest-tier reasoning model in their stack at the time of the build. Inference volume was high enough that the monthly bill from their model provider had grown into six figures and was tracking upward as adoption scaled.

The routing layer was conceptually clean. A small classifier model, custom-trained on roughly 200,000 historical customer-support queries with quality labels, sat in front of the main agent and labeled each incoming query as either “simple” or “complex.” Simple queries were routed to a cheaper model in the same provider family. Complex queries continued to route to the capable model. The classifier itself was a fine-tuned encoder, light enough to run in under 30 milliseconds with negligible cost overhead.

The classification taxonomy was built from production observation. Simple queries were what the team had repeatedly seen: account lookups, billing status questions, password resets, order tracking, and hours-of-operation questions. Complex queries were the ones that had historically required nuanced, multi-step reasoning: refund disputes, plan-change trade-offs, integration troubleshooting, and billing-cycle anomalies. The split looked like about 65 percent simple and 35 percent complex across a representative week of production traffic.

The cheaper model the team selected was about a quarter of the per-token cost of the capable model. For the simple queries the classifier sent to it, side-by-side evaluation against the capable model showed equivalent answer quality across 94 percent of a 5,000-query holdout set. The 6 percent gap was visible, but the team judged it acceptable given the cost reduction. They monitored the cheaper model’s quality through their existing evaluation pipeline, which sampled production responses for human review at roughly half a percent of traffic.

The build took eight weeks. Three engineers, one ML practitioner, partial allocation. They added schema validation between the classifier and the downstream models, instrumentation on the routing decision, and a fallback path in case the classifier itself failed. The deployment was gradual: five percent of traffic for the first week, then ten, then twenty-five, then fifty, then full rollout over six weeks. Each rollout step held quality metrics in the green range. Latency stayed within their existing target. Cost decreased in line with the routing share.

By the end of week eight, the monthly inference bill had dropped to roughly 40 percent of its previous level. The engineering team presented the work at the company’s all-hands. The CFO sent a thank-you note to the AI team. Adoption metrics inside the agent stayed flat to slightly positive. The team moved on to the next quarterly priority.

The work was solid. The architecture was reasonable. The monitoring was in place. The team had done what every recent piece on AI cost optimization had recommended. Each individual decision was defensible. The combined system, however, had created a quality gap that the existing measurement architecture could not see.

That gap took three months to surface in business metrics and another month to be correctly attributed. By the time they understood what was happening, four months had elapsed, and the customer impact was already in the room.

What We Measured (and What We Did Not)

The team’s evaluation architecture before the routing layer was built on the assumption that they were running a single model. The quality signal came from three sources: a daily human-review sample of about 200 responses scored for accuracy and helpfulness; an offline regression suite of approximately 12,000 labeled queries run weekly against the production model; and a satisfaction signal from the agent’s in-product feedback widget, where users could rate responses with a thumbs-up or thumbs-down.

When the routing layer went live, the team extended the human-review sample to maintain the same total of about 200 daily reviews but did not separate it by routing tier. They added the cheaper model to the offline regression suite, where it scored within their acceptance threshold. They left the in-product feedback widget unchanged because it had no way to determine which model had served the response.

In retrospect, those three measurement choices were the seed of the problem. The aggregate human-review sample showed quality holding at roughly the pre-routing baseline. The offline regression suite showed the cheaper model passing on its sub-tier. The feedback widget aggregate stayed within historical variance. Everything they could see was green.

What they were not seeing showed up at three different layers.

The human-review sample, taken without tier-aware sampling, was effectively a weighted average — with 65 percent of the reviews on the cheap model and 35 percent on the capable model. Because the cheap model was equivalent in the easy cases (the high-volume center of the simple-query distribution), it pulled the aggregate up. Quality issues on the harder edge of the simple-query distribution were diluted to the point of invisibility in the aggregate.

The offline regression suite tested both models against curated query sets, but the curation was static. It had been built six months before deployment, when the team had no notion of routing. The suite reflected an idealized distribution rather than the actual production distribution that the cheap model now had to handle. The cheap model passed the static suite but degraded on the live edge.

The in-product feedback widget had a structural problem that the team had known about for over a year but had not prioritized fixing. Customer feedback was sparse. A typical session generated zero ratings. Customers thumbed down responses about 3 times per 1,000 interactions, and those thumbs-down votes were skewed toward customers who were already frustrated about something else entirely. The signal-to-noise ratio on the widget was too low to detect any change smaller than a major regression.

None of these failures was specific to the routing layer. They were latent in the measurement architecture. The routing layer simply exposed them. As long as the system ran on a single model, the measurement gaps did not produce false-positive readings, because there was only one quality distribution to measure. The routing layer introduced two quality distributions, but the existing architecture could not observe them separately.

The quality drift on the cheap-model tier began in week three after the full rollout. By week six, the drift was measurable in the regression suite, but the team interpreted the small regression as model-version drift from their provider rather than routing-related, because they were not segmenting their analysis by tier. By week ten, the cumulative impact on customer satisfaction was evident in product metrics. By week thirteen, churn was tracking measurably above the prior baseline.

That was the point at which the team called me.

What Broke and How We Found It

The diagnosis took two weeks. We reconstructed the routing decisions from the instrumentation log, joined them with the in-product feedback events, and built a per-tier quality view that the team had not previously seen.

The pattern surfaced immediately on the cheap-model tier. The cheap model was performing well on roughly 80 percent of the queries the classifier sent to it, which matched the equivalent-quality finding from the original 5,000-query holdout. But the other 20 percent in production were structurally different from the holdout in ways the classifier could not detect at decision time.

The clearest example was billing queries. The classifier had been trained to recognize patterns such as “where is my charge from” or “I got billed twice” as simple queries, on the assumption that account lookup plus invoice retrieval was a reliable downstream pattern. In holdout testing, this was true. In production, a nontrivial portion of those billing queries hid more complex intents. A user asking “where is my charge from” was sometimes asking about an actual fraudulent charge, sometimes about a delayed reconciliation between two systems, and sometimes about a billing-cycle change they had not been notified about. The capable model had been quietly handling these nested intents correctly because it had the headroom to follow the conversation into the complexity. The cheap model treated each of them as the surface-level intent and answered a question the customer was not actually asking.

The customers who got those wrong answers did not always thumb down. Many of them simply disengaged from the agent and called the support line instead. The thumbs-down signal therefore underrepresented the failure. The cost of the failure was shifted to the human support team, who handled the same query a second time, with the human cost paid out of a different budget. The aggregate effect was that the AI agent’s measured deflection rate remained steady while the actual human-handled support volume began to climb.

The team had not connected the rise in human-handled volume to the routing layer because the two teams operated in different cost centers, and the connection was not visible in any single dashboard.

The cumulative impact on customer satisfaction was harder to measure cleanly, but it eventually showed up in two ways. First, the cohort of customers who interacted with the agent during the routing-layer rollout period showed measurably lower satisfaction scores at the 90-day post-interaction follow-up survey, compared to a baseline cohort from before the rollout. Second, customer retention at the six-month mark trended downward against the prior baseline, with the steepest drop in segments most exposed to the failing routing patterns.

When we ran the numbers together, the inferred cost impact of the quality loss was conservatively four to five times the cost savings from the routing layer. The team had cut inference costs by about $100,000 per month and incurred customer retention and support costs of between $400,000 and $500,000 per month. The routing layer had not reduced costs — it had relocated them into parts of the business that were not instrumented to detect the connection, and by the time the full picture was visible, months of customer trust had already eroded. Cost optimization in AI systems is only as sound as the measurement perimeter it operates within.