Case · B2B SaaS · 2022

A noisy-neighbor outage became a capacity dashboard.

Shape of the problem

A B2B analytics platform was suffering weekly outages whose root cause was always the same in shape and never the same in specifics: a single large tenant running an expensive query at the wrong moment for everyone else. The team's existing approach — per-endpoint rate limits — had the wrong unit. The cost of a request was dominated by the query plan, not the URL.

What we did

We introduced a cost model at the query-planner layer: before execution, each query was assigned an estimated cost in a single unit (effectively CPU-milliseconds), and tenants had per-minute and per-hour budgets in that unit. We paired this with a shadow-traffic harness so the platform team could test changes against a replay of real production load, including the pathological tenants. The first week of production use uncovered four previously-unknown unbounded query shapes; the second week uncovered none.

Outcome

Large-tenant incidents: ~1 per week → none observed across the following six months.
Dashboard produced for the customer-success team showing per-tenant headroom. This turned out to be unexpectedly useful for renewals.
Shadow-traffic harness kept and extended by the client team.

← Back to work