A friend recently asked how to set better capacity testing goals for their tech team. We agreed we needed to get more specific about what “capacity testing” means.
This is a plausible traffic volume based on prior data and forecasts. I encourage this number to come from cross-functional consensus from Eng, Data, Marketing, Product, Sales, etc. Express it as a rate like reqs/sec or orders/min.
If the expected traffic is too low, the service could crash from insufficient capacity. But if it’s too high, you’ve wasted scarce engineering effort building capacity you don’t need (analogous to excess inventory in a supply chain). Site downtime is usually more costly than wasted engineering effort. The virtue of highlighting wasted engineering effort is to recognize the marginal effort and opportunity cost for engineers to support greater capacity.
This is the multiplier or fudge factor to give yourself breathing room. I’d suggest 20-50x for early-stage startups that don’t know when they’ll go viral, 5-20x for growth stage businesses, and <5x for mature businesses with robust prior data and tightly-managed sales/marketing plans. At Glossier, we currently use a 10x safety factor. We were bitten in 2018 with a 5x safety factor and insufficiently detailed traffic forecast.
This is what you’re aiming for.
capacity target = expected traffic * safety factor
So assuming expected traffic of 500 req/sec, and a safety factor of 10x, your capacity target is 5,000 req/sec.
This is the load that your engineers have proven your system can handle during their load tests. Keep scaling and removing bottlenecks until your demonstrated capacity exceeds the capacity target.
Pro tip: run your load tests for several minutes (we do an hour) to “soak” your infrastructure (kudos to Rajas Patil for introducing this idea). This can reveal bottlenecks that don’t show up in quick tests. For example, at Glossier, the data replication from our customer-facing DB to our BI data warehouse was significantly delayed during a 60-minute soak tests, but we wouldn’t have noticed during a quick 5-minute test. By detecting the replication delay early, we had time to mitigate it.