Case · B2B SaaS · 2022
A noisy-neighbor outage became a capacity dashboard.
Shape of the problem
A B2B analytics platform was suffering weekly outages whose root cause was always the same in shape and never the same in specifics: a single large tenant running an expensive query at the wrong moment for everyone else. The team's existing approach — per-endpoint rate limits — had the wrong unit. The cost of a request was dominated by the query plan, not the URL.
What we did
We introduced a cost model at the query-planner layer: before execution, each query was assigned an estimated cost in a single unit (effectively CPU-milliseconds), and tenants had per-minute and per-hour budgets in that unit. We paired this with a shadow-traffic harness so the platform team could test changes against a replay of real production load, including the pathological tenants. The first week of production use uncovered four previously-unknown unbounded query shapes; the second week uncovered none.
Outcome
- Large-tenant incidents: ~1 per week → none observed across the following six months.
- Dashboard produced for the customer-success team showing per-tenant headroom. This turned out to be unexpectedly useful for renewals.
- Shadow-traffic harness kept and extended by the client team.