6 metrics for system observability

February 14, 2022 — 12 minutes to read — Tags: observability, programming, systems, featured

A recent theme of my work has been elevating metrics that focus and align small teams towards healthy business outcomes.

I view this through the lens of systems thinking, with stocks, flows, and feedback loops as the primary mental model (see Thinking In Systems by Donella Meadows and Will Larson’s intro). Examples of such systems abound in organizations. How you hire and onboard employees, develop new products and features, do financial planning and accounting, and interact with customers can all be modeled as systems. In software engineering, our products comprise distributed systems of services and components.

This post is aimed at software engineers, managers, and adjacent teams like marketing and finance who seek to rigorously understand, optimize, and scale the systems they work in. The teams I’ve worked in sometimes struggle to orient themselves in a dizzying volume of system data. I’ve found 6 distinct metrics that give a thorough understanding of the state of a system (observability), and illuminate effective paths towards operational excellence.

The six metrics are:

Throughput
Waste
Lead time (aka latency)
Utilization
Quality
Queue depth

Note that the first four metrics correspond to Google’s Four Golden Signals for monitoring distributed systems. While those are a good start, adding quality and queue depth make the list more robust. And to show how each metric is broadly useful, I’ll apply it to 3 example systems: a hiring funnel, an e-commerce site, and a hospital.

Let’s dive in.

1. Throughput

Throughput is the rate of completed units coming out the end of the flow in a particular time window, such as e-comm orders per day or new hires per month.

When managing systems as a constraint satisfaction problem, you’ll generally seek to maximize throughput while keeping others system metrics within tolerable limits.

Final versus proximate throughput

It’s useful to distinguish the final output of a system versus the intermediate or proximate outputs each step along the way.

For example, if you’re just starting to build a hiring process, it’s difficult to know how many new hires you can expect the existing team to make—hires are a lagging indicator, and you don’t have good priors. You have more control over the early steps in a hiring process, like how many candidates you source, how many pitches you try, or channels you use to find candidates.

After you source candidates, you may find that throughout is slow at another point in your hiring funnel, like passing an on-site interview.

Improving the final throughput of a system is result of effective changes to proximate throughput at each step in your system.

Example metrics

Hiring team: Final output is the count of new hires making productive contributions. Proximate outputs may be the rate at which you source candidates for review, or how many hiring manager screens you completed, how many candidates passed an on-site round, or how many offers you’ve extended.

E-comm site: Final output is the count of orders successfully delivered to customers. Proximate outputs are the number of visitors viewing your products, visitors making a purchase, and the number of order payments successfully charged and ready to ship.

Hospital: Final output is the count of treated patients discharged from the hospital. Proximate outputs are the count of patients seen by medical staff, the count of diagnostic tests performed, and the count of treatments administered.

If improving the proximate throughput does not improve the final throughput, you’ve actually created…

2. Waste

Every system has waste. This is a measure of the undesirable byproducts created by the system.

It’s impossible to completely eliminate all waste. Rather, it’s practical to keep it below a tolerable threshold. Reducing a particular source of waste often becomes exponentially more difficult. If a web app has a lot of bugs, eliminating the first 80% of bugs likely takes a similar level of effort as eliminating the next 15%, and then the next 3%.

As Bill Gates said of eradicating polio:

Fighting polio today is much harder – and different – than fighting it [in] the 80’s and 90’s. The last 40 cases are far more difficult than the first 400,000.

Catastrophic waste

Some waste is generated in a steady, predictable way, like CO2 emissions from a car or household trash. And sometimes it comes as a large, sudden shock that’s hard to predict, like a nuclear meltdown, class action lawsuit, a site outage (like Facebook’s multi-hour outage), or security breach.

When quantifying the wasteful byproducts of your system, consider both the familiar, frequent waste, and the rare potential catastrophes.

Example metrics

Hiring team: Predictable: Rejected candidates and declined offers. Catastrophic: a hiring discrimination lawsuit.

E-comm site: Predictable: Occasional HTTP errors, oversold inventory, returned orders, dissatisfied customers. Catastrophic: complete site outage, security breach, product recalls.

Hospital: Predictable: Ineffective treatments, inconclusive tests. Catastrophic: Larry Nassar’s malfeasance, Thalidomide birth defects scandal.

3. Lead time (or latency)

Lead time is how long it takes for work to go through your system. For a particular unit of work, subtract the time it entered your system from the time it left your system to get the lead time.

For example, if I sign a contract to buy a new car today, and it takes 3 months for the car to be manufactured and delivered, the lead time is 3 months.

Lead time is longer than the cycle time, which is how long a resource in your system is actively working on the unit. The car manufacturer may only spend a few hours assembling any particular car, but the 3 month lead time includes getting each component to the plant and ready for assembly, transit time, quality inspections, etc.

Lead time = cycle time + wait time

Another example: say our system is modeling how a team of developers perform code reviews. If a developer submits new code for review at 9am, then at 11am a teammate spends 15 minutes reviewing the code and approves it, the wait time is 2 hours, the cycle time is 15 minutes, and the lead time is 2h15m.

The lead time is measured at the very edges of the system.

Lead time is often called latency in systems where the consumer’s attention is engaged the whole time; such as when users wait for a UI to respond to input (e.g. in Dan Luu’s post about keyboard latency).

Lead time has a statistical distribution, you’ll probably want to graph a histogram of your lead time distribution to best understand your system. And when setting lead time targets, you’ll focus on particular statistics like median, top 90th percentile, mean, etc. In network systems, power law distributions are common. That is, the majority of requests are fast, and there’s a long tail of a few very slow requests. Or you may have a bimodal distribution, such as a read-through cache system where warm-cache requests are fast, but cache-miss requests are slow.

Reducing the variance of lead time is helpful to more easily understand how your system will behave under increasing throughput. Timeouts are a common technique for reducing lead time variance by adding a strict upper limit.

Reducing cycle times specifically means your system operates more efficiently. It means you’ve eliminated waste, and can therefore increase throughput, reduce utilization, or reduce capacity.

Like waste, reducing cycle times becomes exponentially more difficult.

From a customer’s perspective, latency is a quality measure: customers are more satisfied the faster the web site loads, and the quicker the package arrives.

Example metrics

Hiring team: Time from opening a job requisition to having the new hire start. You can also track the lead time of each step through the hiring funnel: time from a candidate submitting a job application to getting a response from the company; time from a candidate completing all the interviews to getting an offer.

E-comm site: page load time, time it takes from placing and order until it’s confirmed, time from placing an order to having the order delivered

Hospital: Time from a patient seeking treatment to being discharged.

4. Utilization

Utilization is the percentage of time that a particular component in your system is busy. For example, if a cashier in a grocery store takes 2 minutes to checkout out a customer, and 15 customers check out in an hour, the cashier is 50% utilized during that hour. Or if a computer’s CPU has 4 seconds worth of computations to perform in a 10 second period, we’d say the CPU is 40% utilized, and 60% idle.

Utilization has subtle and counterintuitive implications for your system. I used to think that 100% utilization represents a perfectly optimized and efficient system, and that less than 100% utilization implies idle waste.

But no! Any component that’s 100% utilized is the bottleneck that constrains the throughput of your system. Work becomes backlogged behind the constrained resource. A bit of algebra shows some harmful implications of an over-utilized system.

Let’s reconsider the example of a grocery store where the sole cashier takes 2 minutes (120 seconds) to check out each customer. Suppose new customers arrive faster than the cashier can check them out, say every 90 seconds. The line behind the register grows longer (that’s queue depth). The wait time for each customer to check out grows longer and longer.

You’ll eventually run out of space in the store with shoppers waiting to check out, assuming your frustrated customers don’t leave on their own. You’ll have to either increase capacity by adding another cashier, or throttle arrivals by turning away customers until the store is less crowded.

Increasing capacity and throttling the arrival rate are the only ways to decrease utilization without re-engineering your system.

Kingsman’s formula (specifically the ρ / 1 - ρ term) from queuing theory shows that as utilization approaches 100%, wait time approaches infinity!

The Phoenix Project explains it well:

The wait time for a given resource is the percentage that resource is busy, divided by the percentage that resource is idle. So, if a resource is fifty percent utilized, the wait time is 50/50, or 1 unit. If the resource is ninety percent utilized, the wait time is 90/10, or nine times longer.

This explains why some product dev backlogs are depressingly long.

This lens for understanding utilization also explains why a few slow database queries can crash a web app. If the DB CPU is 100% utilized, new queries queue up as they wait for CPU, causing each response to take ever longer to generate. Eventually, clients time out (like our grocery shoppers leaving in frustration), and visitors get an error page.

So if 100% utilization is bad, what’s good? My general rule is to aim for 80% utilization for most critical resources (see this post about the 80% rule). This doesn’t necessarily mean that the resource is idle the rest of the time—rather it’s doing work that can easily be preempted and temporarily delayed, like a chef wiping down their workspace or a development team working on code cleanup tasks. This flex time ensures your system can gracefully adapt to disruptions. You can increase or decrease the target utilization based on how steady or variable the inputs are.

See Goldblatt’s Five Focusing Steps from his Theory of Constraints to make the best use of your system’s bottlenecks.

Example metrics

Hiring team: % of employee time in interviews, or number of interviews per week per interviewer. % of time your conference rooms are occupied

E-comm site: Compute: % of CPU, network IO, disk IO, and memory that each app tier uses. Fulfillment: % of inventory space currently occupied, % of time that pickers and packing stations are in use

Hospital: % of hours per week an operating room is in use, % of beds that are in use, % of medical devices in uses, # hours per week that doctors are with patients.

5. Quality

Quality is the measure of desirable and undesirable traits in the output of your system.

For example, a high quality car may have great fuel efficiency, high resale value, and run for many miles with routine maintenance. Or it may have luxury features that signal the owner’s status, and fast acceleration and handling that are enjoyable to drive. It may have brand associations that resonate with the owner’s identity.

From a manufacturer’s perspective, a high quality car uses parts that are reliably sourced, is easy to assemble, and has a large and predictable market demand.

Some quality metrics are easily quantified, like fuel efficiency and resale value, or they may be fuzzy and qualitative, like brand associations and status signaling.

Another example: a high-quality e-commerce homepage has a low bounce rate, and increases visitors’ interest in buying your products. It makes cold leads warmer, and converts warm leads. It’s fast and reliable.

Note that latency is a quality measure. Consumers would prefer getting things sooner rather than later. But latency has important implications for the rest of your system around capacity and throughput, so it’s worth observing separately.

So how should you measure quality? If you’re generally satisfied with the quality of your system’s outputs, I recommend monitoring a handful of quality metrics to ensure you maintain your expected level of quality. And occasionally ratchet up your quality bar, and intervene quickly when quality slips.

For systems producing low-quality outputs, I recommend picking a single quantitative quality metric you expect to unlock the most throughput, improve it, and then iterate to focus on another quality metric until you reach a healthy level of quality.

Example metrics

Hiring team: The degree to which the new hire increases the team’s output, morale, and adaptive capacity

E-comm site: Degree to which the customer is satisfied by the purchase. CSAT and NPS are popular measures. You can also consider customer lifetime value (LTV), and the conversion rate of each step in the customer acquisition funnel.

Hospital: The degree to which the treatment mitigates symptoms, adds life-years, and does not cause harmful side effects.

6. Queue depth

Queue depth (also called stock level) is a count of how many units are in a particular state of your system.

Queues form behind the bottleneck in your system, so observing queues is an easy way to pinpoint utilization issues. In the grocery store analogy, the queue of customers with full shopping carts waiting to check out is an early and obvious sign that the cashiers are the bottleneck.

Sudden shocks to a system also manifest as rapid changes in queue depth, such as unassembled parts queuing up behind a broken component in an assembly line, or a long line of customers at a bakery when a busload of tourists arrives.

Just as a queue that’s too large indicates over-utilization, a queue that’s too small indicates under-utilization and risk of supply shocks.

And a software development team with nothing in their backlog is likely under-utilized. And keeping a strategic stockpile of resources makes your system more resilient to unreliable supply.

Therefore, aim to keep queue depths within a healthy range that prevents under-utilization and fragility to supply shocks, while also avoiding the high carrying costs and long wait times of excessive queue depth.

Example metrics

Hiring team: The number of candidates between each step in the hiring funnel. E.g. applied but not screened, screened but waiting for an on-site interview, interviewed on-site but awaiting an offer, offer extended but not yet accepted

E-comm site: Inventory on ready to be sold in the warehouse, inventory in transit from suppliers, number of orders in each step of your fulfillment process: waiting for payment, waiting to be picked, waiting to be shipped

Hospital: number of patients in the waiting room, waiting for test results, or waiting for an operation.

Conclusion

With those six metrics, you have a robust toolkit to understand, optimize, and scale the systems you work with.

These metrics encourage practitioners to keep their system models simple. You may argue that it’s an oversimplification. That’s certainly true for some systems—these metrics are not sufficient for understanding complex adaptive systems with emergent behaviors and nonlinear dynamics, like markets and ecosystems.

Donella Meadows neatly highlights the limits of my approach in her article about leverage points to intervene in a system. This post focuses on the simple steps: changing parameters and feedback loops. The higher steps involve changing the rules, power structures, and paradigms around a system—a much larger task!

But all models are incomplete, and I find this one is useful. Progress towards observability comes from reframing big hairy challenges as a sequence of simple steps.

1. Throughput

Final versus proximate throughput

Example metrics

2. Waste

Catastrophic waste

Example metrics

3. Lead time (or latency)

Example metrics

4. Utilization

Example metrics

5. Quality

Example metrics

6. Queue depth

Example metrics

Conclusion

Hi, I'm Aaron Suggs. 😀👋