April 26, 20221 minute to read — Tags: documentation, elsewhere

I have a new interview on the DX Fieldnotes blog: How to Champion Investments to Improve Documentation. Brook Perry at DX did a great job structuring the interview around practical advice.

Some highlights:

  • documentation is a means to an end. Be clear about the goals (or ‘job to be done’) that you want from your docs. E.g., onboard team members faster, reduce downtime, have more consistent practices, crossfunctional awareness + alignment.
  • make the authoring+maintaining experience incredibly easy.
  • have a ‘one-stop-shop’ for the discovery experience.
  • when choosing what to document first, use two techniques: ‘follow the pain’ (write the most-sought missing docs), and ‘stop the bleeding’ (ensure new projects don’t further exacerbate knowledge silos).

April 18, 20226 minutes to read — Tags: leadership, mental models, systems

Humans constantly construct narratives to make sense of a messy reality. If a new product launch went well, we could say it was because of the team’s good planning and coordinated execution, or one person’s heroic last-minute effort. Did another team miss their targets? We could say that’s because the higher-ups keep changing their minds, or the team lacked the necessary skills. We can construct wildly different stories to understand the same situation.

Effective leaders construct compelling narratives for their team to understand the present reality, and inspire them to achieve a shared vision. It’s hard to say any particular narrative is right or wrong. Rather, let’s ask which narrative is useful and evocative? Let’s break it down.

In this post, I’ll describe 3 distinct perspectives or “lenses” to use when building a narrative and vision: 1️⃣ rational decisions, 2️⃣ organizational process, and 3️⃣ people and politics. Many leaders tend to lean heavily on one particular perspective. I’ve found myself becoming a more resilient leader by being mindful of all 3 perspectives, and incorporating them into the narratives I use to lead.

This post draws from Graham Allison’s Essence of Decision, and Core Strengths SDI leadership assessment, and Ron Westrum’s 3 types of organizational culture (original research paper). See the appendix for more details.

Perspective 1: Rational decisions

The rational decisions lens assumes that all the appropriate information was gathered and evaluated. It uses a dispassionate, objective analysis focused on unambiguous, quantified results and expected outcomes. We presume people behave like Homo economicus.

The advantage of this perspective is that its conclusions seem logically justified—the optimal outcome may be deduced from the present facts. The disadvantage is that it’s woefully incomplete way of understanding how humans actually behave.

Perspective 2: Organizational process

This perspective focuses on the small set of things that a particular team, organization, or institution is capable of doing well.

Unlike the open-ended possibilities of the ‘Rational decisions’ perspective, the organizational process strongly prefers the well-known options.

Any initiative that fits into an organization’s established processes and leverages existing competencies has a greater chance of success than one that requires novel processes and decisions.

For example, a manager may have an easy time promoting an employee during their organizations’ semiannual performance review period, but have difficulty doing an off-cycle ad hoc promotion. When the employee gets a promotion may have more to do with the quirks of a bureaucratic process than the particular moment they merit it.

An engineering team composed of, say, Frontend and Backend engineers may struggle to hire and onboard a new specialty, like a Site Reliability or Security Engineer. The novelty of the role increases the risks and the amount of effort required.

The advantage of this perspective is that it focuses on leveraging familiar solutions. The disadvantage is that those familiar solutions may become increasingly outdated and insufficient to new challenges. There’s a thin line between “strategic laziness” and “narrow mindedness”.

Perspective 3: People and politics

This perspective focuses on the zero-sum game of political power and influence. The only way to gain power or influence is to take it from someone else. If you don’t please the people who are the source of your influence (your employees, peers, manager, customers, shareholders, etc), you’ll quickly loose it.

A ‘people and politics’ narrative focuses on individuals or groups with rising or waning influence or recognition. Who gets their way, and who doesn’t? Who’s life gets easier, and who’s life gets harder?

Many people, including myself, are uncomfortable dwelling in this lens. The zero-sum logic can breed unhealthy competition, even combativeness and Machiavellian manipulation rather than collaborative, constructive dialogue.

But humans are social, competitive creatures. Our brains inevitably focus on whether the people we like are rewarded, and the people we don’t like get what they deserve. People around you are constantly evaluating your reputation, and weighing whether they should hitch their fate to yours. Leaders neglect this perspective at their peril.

An advantage of this perspective over Rational decisions is that it better captures how people actually behave. But the disadvantage is that it does not, by itself, reveal effective strategies. A leader who baldly declares, “Our strategy is do whatever most improves my reputation” would often be abandoned as a sociopath. Leaders must incorporate other perspectives.


Okay, let’s see how these 3 perspectives can apply to the same hypothetical story.

Say your software team is building a new service, and an early decision is what programming language to use for the project. After some discussion, Typescript emerges of the language of choice. Why?

The rational decisions narrative could be that the leaders of the team carefully considered the suitability of various programming language attributes for the project, e.g. typed vs dynamic languages, security and operability concerns, the ecosystem of tooling and shared libraries, the current expertise of the team, the learning curve to train new developers, and the size of the community. The leaders wrote up a decision doc weighing all these factors and ultimately choosing Typescript because, e.g., type safety would benefit the project and ease maintenance, the language had a large and growing community, and a rich ecosystem of developer tooling.

The organizational process narrative says that the tech team only had the infrastructure tooling and expertise to support, say, Ruby and Typescript. Choosing anything besides Ruby or Typescript would have required a lot of persuading and training that the project leaders didn’t think was worthwhile. The platform/infra team recently announced a lot of new tooling to support Typescript services. The team chose Typescript because they perceived organizational momentum towards it.

And finally the people and politics narrative is that tech lead of the project had spent years maintaining a poorly architected Ruby service. They were eager to try a something different, and had recently cajoled the platform/infra team to better support their new favorite language: Typescript. When their manager asked them to lead the new project, they accepted on the condition that the manager would support their choice of language. The manager agreed since they did not have a good alternative to be tech lead.

Personally, the rational decision narrative is what I want to believe, but the last 2 feel more realistic.

One final point: I think Camille Fournier really nailed these perspectives in her post about build vs buy decisions. The ‘rational decisions’ lens says teams should ‘buy’ more often than they actually do. To understand the discrepancy between theory and practice, she highlights some institutional and political factors that encourage building:

our whole style of teaching computer science is first-principles based, which encourages the default to build ourselves. … Companies reward people who create new things … which creates a pressure to build in order to grow your career.

The lesson for me is that none of these perspectives are right or wrong—they’re all incomplete. And that I’m well-served by bearing in mind all 3 when understanding or explaining decisions.


Appendix

In my recent research about organizational effectiveness, I was struck by the congruence of 3 models for understanding how leaders make decisions:

  1. Ron Westrum’s 3 types of organizational culture: Pathological, Bureaucratic, and Generative
  2. Graham Allison’s Essence of Decision examines the Cuban Missile Crisis through the lens of a Government Politics, Organizational Behavior, and Rational Actor.
  3. The Core Strengths leadership assessment charts personal styles on 3 axis: People, Process, and Performance

Here’s how I organized them into analogous lenses:

Model Lens 1: Rational decisions Lens 2: Organizational process Lens 3: People and politics
Graham Allison's model Rational Actor Organizational Process Bureaucratic Politics
Ron Westrum's model Generative culture Bureaucratic culture Pathological culture
Core Strengths SDI model Performance Process People

March 7, 20223 minutes to read — Tags: productivity

I love a good low-code productivity hack. Here are some of my favorite iOS shortcuts that make my life slightly easier.

Easy after-school pickups

At my daughter’s elementary after-school program, the pickup protocol is for the parent to send a text message to the program director, who then brings the kid to the front door.

I noticed a regular routine: I arrive in the school parking lot, take out my phone, find the program director in the Messages app, and copy+paste my previous message along the lines “Hello, I’m here to pick up [my child’s name].”

Let’s automate that!

I created a location-based iOS Shortcut: when I arrive at the school between 3-6pm on weekdays, send that text message to the program director. Now when I park at the school, I pull out my phone, and tap a shortcut notification on the lock screen, and the text is on it’s way!

This is inspired by Shawn Blanc’s gym pass shortcut.

Office music

I frequently listen to music on AirPlay speakers in my home office. But I don’t like the temptation of unlocking my phone and tapping around to find my playlist. It’s too tempting to start checking notifications and social media.

So to avoid unlocking my phone, I got these NFC stickers and stuck one on my desk.

Then I made an iOS shortcut: when I tap my phone to the NFC tag, set the playback destination to my office speakers, and play my favorite playlist or app. I change the playlist or app in the shortcut when the mood strikes. Recently Lofi Girl and Endel soundscapes are in heavy rotation.

Set Slack status messages the hard way

I have several shortcuts that change my slack status. When I start a walking workout, a shortcut automatically updates my slack status to “AFK for a walk”.

When I enable “Do Not Disturb” focus mode in iOS/macOS, another shortcut automatically sets my Slack status to “DND - writing”.

It took a lot of trial and error (and some yak shaving) to get the Slack API permissions configured correctly, so I’ll put the steps here.

  1. Create a new Slack app.
  2. In your app, set a required redirect URL. Note you can use any URL, say google.com. You’re the only one who will use it, and it doesn’t affect the functionality.
  3. In the “User Token Scopes” section, add the users.profile:write scope so the app can update your profile.
  4. Install the app to your workspace (or ask your workspace admin to do it for you).
  5. Now you should see a section in your app called “OAuth Tokens for Your Workspace”. Copy the User OAuth token that starts xoxp- for use in the shortcut.

With the token in hand, you can now use the Get contents of URL shortcut action to make the Slack API request to users.profile.set.

Here’s a screenshot of what a successful shortcut looks like. Note that the first 3 steps are to compute the expiration time as a unix timestamp (seconds since 1970)—i.e. ‘30 minutes from now’. And the Authorization header is Bearer xoxp-XXXX.

Slack status shortcut

Happy hacking! And please let me know on Twitter if you enjoy these shortcuts.

February 19, 20221 minute to read — Tags: mental models, leadership

A useful mental model I recently learned is enumerating the ‘left ditch’ and ‘right ditch’ solutions.

Imagine we’re steering a vehicle along a winding road and each turn is a new challenge. We want to avoid under-correcting to end up in the left ditch, and over-correcting to end up in the right ditch.

For example, say a software component has recently become unreliable, with frequent bugs and performance issues. A ‘left ditch’ under-correction may be to repeatedly add one-off “band-aid” bug fixes to each issue. A ‘right ditch’ over-correction may be a complete rewrite the component. A ‘just right’ solution may be introducing a new design pattern to the component that improves correctness, scalability, maintainability, etc, and gradually migrate the entire component to that new pattern.

At Glossier, we called this finding the ‘stage appropriate’ solution, and it’s a generalization of the spectrum of involvement. I’ve found it helpful to explicitly discuss what a ‘left ditch’ and ‘right ditch’ solutions would be, to open up the middle path for a ‘just right’ solution.

February 14, 202212 minutes to read — Tags: observability, programming, systems

A recent theme of my work has been elevating metrics that focus and align small teams towards healthy business outcomes.

I view this through the lens of systems thinking, with stocks, flows, and feedback loops as the primary mental model (see Thinking In Systems by Donella Meadows and Will Larson’s intro). Examples of such systems abound in organizations. How you hire and onboard employees, develop new products and features, do financial planning and accounting, and interact with customers can all be modeled as systems. In software engineering, our products comprise distributed systems of services and components.

This post is aimed at software engineers, managers, and adjacent teams like marketing and finance who seek to rigorously understand, optimize, and scale the systems they work in. The teams I’ve worked in sometimes struggle to orient themselves in a dizzying volume of system data. I’ve found 6 distinct metrics that give a thorough understanding of the state of a system (observability), and illuminate effective paths towards operational excellence.

The six metrics are:

  1. Throughput
  2. Waste
  3. Lead time (aka latency)
  4. Utilization
  5. Quality
  6. Queue depth

Note that the first four metrics correspond to Google’s Four Golden Signals for monitoring distributed systems. While those are a good start, adding quality and queue depth make the list more robust. And to show how each metric is broadly useful, I’ll apply it to 3 example systems: a hiring funnel, an e-commerce site, and a hospital.

Let’s dive in.

1. Throughput

Throughput is the rate of completed units coming out the end of the flow in a particular time window, such as e-comm orders per day or new hires per month.

When managing systems as a constraint satisfaction problem, you’ll generally seek to maximize throughput while keeping others system metrics within tolerable limits.

Final versus proximate throughput

It’s useful to distinguish the final output of a system versus the intermediate or proximate outputs each step along the way.

For example, if you’re just starting to build a hiring process, it’s difficult to know how many new hires you can expect the existing team to make—hires are a lagging indicator, and you don’t have good priors. You have more control over the early steps in a hiring process, like how many candidates you source, how many pitches you try, or channels you use to find candidates.

After you source candidates, you may find that throughout is slow at another point in your hiring funnel, like passing an on-site interview.

Improving the final throughput of a system is result of effective changes to proximate throughput at each step in your system.

Example metrics

Hiring team: Final output is the count of new hires making productive contributions. Proximate outputs may be the rate at which you source candidates for review, or how many hiring manager screens you completed, how many candidates passed an on-site round, or how many offers you’ve extended.

E-comm site: Final output is the count of orders successfully delivered to customers. Proximate outputs are the number of visitors viewing your products, visitors making a purchase, and the number of order payments successfully charged and ready to ship.

Hospital: Final output is the count of treated patients discharged from the hospital. Proximate outputs are the count of patients seen by medical staff, the count of diagnostic tests performed, and the count of treatments administered.

If improving the proximate throughput does not improve the final throughput, you’ve actually created…

2. Waste

Every system has waste. This is a measure of the undesirable byproducts created by the system.

It’s impossible to completely eliminate all waste. Rather, it’s practical to keep it below a tolerable threshold. Reducing a particular source of waste often becomes exponentially more difficult. If a web app has a lot of bugs, eliminating the first 80% of bugs likely takes a similar level of effort as eliminating the next 15%, and then the next 3%.

As Bill Gates said of eradicating polio:

Fighting polio today is much harder – and different – than fighting it [in] the 80’s and 90’s. The last 40 cases are far more difficult than the first 400,000.

Catastrophic waste

Some waste is generated in a steady, predictable way, like CO2 emissions from a car or household trash. And sometimes it comes as a large, sudden shock that’s hard to predict, like a nuclear meltdown, class action lawsuit, a site outage (like Facebook’s multi-hour outage), or security breach.

When quantifying the wasteful byproducts of your system, consider both the familiar, frequent waste, and the rare potential catastrophes.

Example metrics

Hiring team: Predictable: Rejected candidates and declined offers. Catastrophic: a hiring discrimination lawsuit.

E-comm site: Predictable: Occasional HTTP errors, oversold inventory, returned orders, dissatisfied customers. Catastrophic: complete site outage, security breach, product recalls.

Hospital: Predictable: Ineffective treatments, inconclusive tests. Catastrophic: Larry Nassar’s malfeasance, Thalidomide birth defects scandal.

3. Lead time (or latency)

Lead time is how long it takes for work to go through your system. For a particular unit of work, subtract the time it entered your system from the time it left your system to get the lead time.

For example, if I sign a contract to buy a new car today, and it takes 3 months for the car to be manufactured and delivered, the lead time is 3 months.

Lead time is longer than the cycle time, which is how long a resource in your system is actively working on the unit. The car manufacturer may only spend a few hours assembling any particular car, but the 3 month lead time includes getting each component to the plant and ready for assembly, transit time, quality inspections, etc.

Lead time = cycle time + wait time

Another example: say our system is modeling how a team of developers perform code reviews. If a developer submits new code for review at 9am, then at 11am a teammate spends 15 minutes reviewing the code and approves it, the wait time is 2 hours, the cycle time is 15 minutes, and the lead time is 2h15m.

The lead time is measured at the very edges of the system.

Lead time is often called latency in systems where the consumer’s attention is engaged the whole time; such as when users wait for a UI to respond to input (e.g. in Dan Luu’s post about keyboard latency).

Lead time has a statistical distribution, you’ll probably want to graph a histogram of your lead time distribution to best understand your system. And when setting lead time targets, you’ll focus on particular statistics like median, top 90th percentile, mean, etc. In network systems, power law distributions are common. That is, the majority of requests are fast, and there’s a long tail of a few very slow requests. Or you may have a bimodal distribution, such as a read-through cache system where warm-cache requests are fast, but cache-miss requests are slow.

Reducing the variance of lead time is helpful to more easily understand how your system will behave under increasing throughput. Timeouts are a common technique for reducing lead time variance by adding a strict upper limit.

Reducing cycle times specifically means your system operates more efficiently. It means you’ve eliminated waste, and can therefore increase throughput, reduce utilization, or reduce capacity.

Like waste, reducing cycle times becomes exponentially more difficult.

From a customer’s perspective, latency is a quality measure: customers are more satisfied the faster the web site loads, and the quicker the package arrives.

Example metrics

Hiring team: Time from opening a job requisition to having the new hire start. You can also track the lead time of each step through the hiring funnel: time from a candidate submitting a job application to getting a response from the company; time from a candidate completing all the interviews to getting an offer.

E-comm site: page load time, time it takes from placing and order until it’s confirmed, time from placing an order to having the order delivered

Hospital: Time from a patient seeking treatment to being discharged.

4. Utilization

Utilization is the percentage of time that a particular component in your system is busy. For example, if a cashier in a grocery store takes 2 minutes to checkout out a customer, and 15 customers check out in an hour, the cashier is 50% utilized during that hour. Or if a computer’s CPU has 4 seconds worth of computations to perform in a 10 second period, we’d say the CPU is 40% utilized, and 60% idle.

Utilization has subtle and counterintuitive implications for your system. I used to think that 100% utilization represents a perfectly optimized and efficient system, and that less than 100% utilization implies idle waste.

But no! Any component that’s 100% utilized is the bottleneck that constrains the throughput of your system. Work becomes backlogged behind the constrained resource. A bit of algebra shows some harmful implications of an over-utilized system.

Let’s reconsider the example of a grocery store where the sole cashier takes 2 minutes (120 seconds) to check out each customer. Suppose new customers arrive faster than the cashier can check them out, say every 90 seconds. The line behind the register grows longer (that’s queue depth). The wait time for each customer to check out grows longer and longer.

You’ll eventually run out of space in the store with shoppers waiting to check out, assuming your frustrated customers don’t leave on their own. You’ll have to either increase capacity by adding another cashier, or throttle arrivals by turning away customers until the store is less crowded.

Increasing capacity and throttling the arrival rate are the only ways to decrease utilization without re-engineering your system.

Kingsman’s formula (specifically the ρ / 1 - ρ term) from queuing theory shows that as utilization approaches 100%, wait time approaches infinity!

The Phoenix Project explains it well:

The wait time for a given resource is the percentage that resource is busy, divided by the percentage that resource is idle. So, if a resource is fifty percent utilized, the wait time is 50/50, or 1 unit. If the resource is ninety percent utilized, the wait time is 90/10, or nine times longer.

This explains why some product dev backlogs are depressingly long.

This lens for understanding utilization also explains why a few slow database queries can crash a web app. If the DB CPU is 100% utilized, new queries queue up as they wait for CPU, causing each response to take ever longer to generate. Eventually, clients time out (like our grocery shoppers leaving in frustration), and visitors get an error page.

So if 100% utilization is bad, what’s good? My general rule is to aim for 80% utilization for most critical resources (see this post about the 80% rule). This doesn’t necessarily mean that the resource is idle the rest of the time—rather it’s doing work that can easily be preempted and temporarily delayed, like a chef wiping down their workspace or a development team working on code cleanup tasks. This flex time ensures your system can gracefully adapt to disruptions. You can increase or decrease the target utilization based on how steady or variable the inputs are.

See Goldblatt’s Five Focusing Steps from his Theory of Constraints to make the best use of your system’s bottlenecks.

Example metrics

Hiring team: % of employee time in interviews, or number of interviews per week per interviewer. % of time your conference rooms are occupied

E-comm site: Compute: % of CPU, network IO, disk IO, and memory that each app tier uses. Fulfillment: % of inventory space currently occupied, % of time that pickers and packing stations are in use

Hospital: % of hours per week an operating room is in use, % of beds that are in use, % of medical devices in uses, # hours per week that doctors are with patients.

5. Quality

Quality is the measure of desirable and undesirable traits in the output of your system.

For example, a high quality car may have great fuel efficiency, high resale value, and run for many miles with routine maintenance. Or it may have luxury features that signal the owner’s status, and fast acceleration and handling that are enjoyable to drive. It may have brand associations that resonate with the owner’s identity.

From a manufacturer’s perspective, a high quality car uses parts that are reliably sourced, is easy to assemble, and has a large and predictable market demand.

Some quality metrics are easily quantified, like fuel efficiency and resale value, or they may be fuzzy and qualitative, like brand associations and status signaling.

Another example: a high-quality e-commerce homepage has a low bounce rate, and increases visitors’ interest in buying your products. It makes cold leads warmer, and converts warm leads. It’s fast and reliable.

Note that latency is a quality measure. Consumers would prefer getting things sooner rather than later. But latency has important implications for the rest of your system around capacity and throughput, so it’s worth observing separately.

So how should you measure quality? If you’re generally satisfied with the quality of your system’s outputs, I recommend monitoring a handful of quality metrics to ensure you maintain your expected level of quality. And occasionally ratchet up your quality bar, and intervene quickly when quality slips.

For systems producing low-quality outputs, I recommend picking a single quantitative quality metric you expect to unlock the most throughput, improve it, and then iterate to focus on another quality metric until you reach a healthy level of quality.

Example metrics

Hiring team: The degree to which the new hire increases the team’s output, morale, and adaptive capacity

E-comm site: Degree to which the customer is satisfied by the purchase. CSAT and NPS are popular measures. You can also consider customer lifetime value (LTV), and the conversion rate of each step in the customer acquisition funnel.

Hospital: The degree to which the treatment mitigates symptoms, adds life-years, and does not cause harmful side effects.

6. Queue depth

Queue depth (also called stock level) is a count of how many units are in a particular state of your system.

Queues form behind the bottleneck in your system, so observing queues is an easy way to pinpoint utilization issues. In the grocery store analogy, the queue of customers with full shopping carts waiting to check out is an early and obvious sign that the cashiers are the bottleneck.

Sudden shocks to a system also manifest as rapid changes in queue depth, such as unassembled parts queuing up behind a broken component in an assembly line, or a long line of customers at a bakery when a busload of tourists arrives.

Just as a queue that’s too large indicates over-utilization, a queue that’s too small indicates under-utilization and risk of supply shocks.

And a software development team with nothing in their backlog is likely under-utilized. And keeping a strategic stockpile of resources makes your system more resilient to unreliable supply.

Therefore, aim to keep queue depths within a healthy range that prevents under-utilization and fragility to supply shocks, while also avoiding the high carrying costs and long wait times of excessive queue depth.

Example metrics

Hiring team: The number of candidates between each step in the hiring funnel. E.g. applied but not screened, screened but waiting for an on-site interview, interviewed on-site but awaiting an offer, offer extended but not yet accepted

E-comm site: Inventory on ready to be sold in the warehouse, inventory in transit from suppliers, number of orders in each step of your fulfillment process: waiting for payment, waiting to be picked, waiting to be shipped

Hospital: number of patients in the waiting room, waiting for test results, or waiting for an operation.

Conclusion

With those six metrics, you have a robust toolkit to understand, optimize, and scale the systems you work with.

These metrics encourage practitioners to keep their system models simple. You may argue that it’s an oversimplification. That’s certainly true for some systems—these metrics are not sufficient for understanding complex adaptive systems with emergent behaviors and nonlinear dynamics, like markets and ecosystems.

Donella Meadows neatly highlights the limits of my approach in her article about leverage points to intervene in a system. This post focuses on the simple steps: changing parameters and feedback loops. The higher steps involve changing the rules, power structures, and paradigms around a system—a much larger task!

But all models are incomplete, and I find this one is useful. Progress towards observability comes from reframing big hairy challenges as a sequence of simple steps.

November 24, 20212 minutes to read — Tags: personal

This is the eve of Thanksgiving in the US, an occasion to reflect on what we’re grateful for. This year, I’d like to share my gratitude publicly.

First and foremost, I’m thankful that my close family and loved ones are healthy, and we generally feel safe and secure. This has been the case virtually all my life. But it wasn’t the case last year, so I appreciate all the more this year.


I’m thankful for good advice to view life’s challenges though a useful lens. Early in the year, I shared with an older friend that I felt disoriented about where my career and personal relationships were going. “Of course, you’re at the age to feel that way,” said my friend.

They went on to explain that I’d established a career, and I’d gotten through the early years of parenting. Now my spouse and I were finding ourselves with more time and energy than we knew what to do with. They said, “Now is the time to rewire your relationships for the next few decades.”

I love that metaphor: rewire relationships for the next decades. It unlocked fantastic questions. How do I want to treat my spouse, and how do I want to be treated, from now into retirement? What’s the relationship I want with our kids? How do I want to interact with my community? What’s my professional identity and relationship to work?

This year, I’m grateful for those focusing questions, and the effort I’ve put towards answering them.


I’m thankful for learning, both about myself and others. My personal therapy has revealed that I’m often unable to acknowledge my own negative emotions. My beginning to do so feels revelatory. It feels like uncovering a straight path through what seemed like a labyrinth.

I began reading a book each week (and updating my StoryGraph profile). It feels like compound interest for developing interesting ideas.

I’m thankful to my colleagues at Glossier from whom I’ve learned so much, from hiring and interviewing, how to plan and coordinate large initiatives, to building trust and alignment, and when to be scrappy and pragmatic. Today is the first Black Friday weekend in four years that I’m not primary on-call. It’s a great team, and I’m humbled to be part of it.

I’m grateful for new opportunities. In December, I’m starting a new job that feels like a profound blessing. I think it’s the flow state I’ve been craving, challenging me to the brink of my abilities. But that’s a story for another day.

Happy Thanksgiving.

November 7, 20214 minutes to read — Tags: epistemology, mental models, programming

One of the more powerful concepts I’ve found is considering which type of analysis to apply to the challenge at hand. It’s been especially useful when coaching a software team to put in more and different effort to find the root cause of a failure, or occasionally to save effort when over-analyzing a situation.

So here’s my survey of ‘types of analysis’, inspired by (and expanding on) a section of Accelerate, chapter 12.

Forensic Predictive Causal Mechanistic
Confidence of claims Low, new evidence could drastically change conclusions Modest, based on suppositions about the past and present contexts Higher, this was definitely true in the past Very high, claims will be true forever
General uses Paleontology, criminal investigations Meteorology, government policies Clinical trials Rocket science, material science
Software development uses Security incident investigations Roadmap planning, project selection, scoping, and sequencing A/B tests of user behavior Unit testing, debugging

Forensic analysis

Forensic analysis seeks to understand an incident when it’s impossible to fully recreate the incident.

Forensic analysis gathers evidence, then uses abductive reasoning to find the most reasonable explanation that fits evidence. When criminal detectives and attorneys argue a theory of a crime, or paleontologists describe how dinosaurs behaved based on fossil records, they’re using abductive reasoning. Any claims based on forensic analysis and abductive reasoning are contingent on the available evidence. It’s always possible that new evidence comes to light that dramatically changes our understanding of an incident.

In software development, we use forensic analysis to investigate security incidents, or when we search through logs to figure out when or how something occurred.

Forensic analysis is often the least effort to perform, and the least certain or generalizable in its conclusions.

Predictive analysis

Predictive analysis uses historical data and inductive reasoning to make claims about the future in an uncontrolled environment.

When a meteorologist makes a weather forecast, or when the CBO forecasts what a new tax policy will cost over the next decade, they’re using predictive analysis. When lawmakers consider the pros and cons of passing a law, they (ideally) use predictive analysis.

Because these analyses are uncontrolled, it’s impossible to repeat an experiment. That is, if you make a decision based on forecasts, you can’t know precisely what would have happened if you’d made a different decision. Like, what would society be like if the government had passed a different law years ago? It’s a counterfactual. The best we can do is speculate.

In software development, we use predictive analysis when creating roadmaps, scoping, and sequencing our projects. We can only speculate about how things would turn out if we chose different projects.

Causal analysis

Causal analysis uses controlled experiments to see both sides of a decision (i.e., a treatment and control group). This lets us make much stronger claims about the efficacy of a treatment than just predictive analysis.

Causal analysis is most well known in medicine for clinical drug trials. By comparing the outcomes of subjects in the control and treatment groups, we can be fairly certain that differences in outcomes were caused by the treatment.

In software development, we use causal analysis when we do A/B tests. Is the user more likely to click the button if we present information this way, or that way?

Causal analysis is not generalizable. Are the subjects in an experiment representative of the broader population? You need to re-do the experiment to find out. Say an e-commerce company finds that adding a big flashing “NEW” icon next to products increases sales. How long does that effect last? You’d have to do another A/B test to find out. Would it work as well for a different brand? Gotta do another A/B test.

Causal analysis is clever in that the mechanism by which the treatment works is irrelevant to the experiment. It doesn’t let us say why a new drug works, or why users click the button more; just that we have some degree of confidence that the treatment causes some desirable outcome.

Mechanistic analysis

Mechanistic analysis supports the strongest claims about why and how a system works.

It requires a thorough understanding of the mechanisms that make the system work. You can repeatedly apply the well-modeled system to new situations. It relies well-known scientific theories that are taken as axioms, and uses deductive reasoning to derive useful applications.

Some examples: When 19th century inventors sent electrical current through a carbon filament and discovered that the filament glowed, they discovered a physical system for producing artificial light. You can re-use that system: every carbon filament glows in predictable ways when you run electricity through it.

When rocket scientists calculate how much fuel they’ll need to launch a payload into orbit, or when civil engineers analyze whether a bridge design will support a given load, they use mechanistic analysis. Then answer is the solution to a mathematical equation.

They don’t build many bridges, then drive cars over them to see if they stay up. That kind of predictive analysis is insufficient for the challenge.

In software engineering, mechanistic analysis is best used when testing and debugging. If we can reliably reproduce a bug, or reliably verify the correctness of a system, it’s strong evidence that we’ve isolate a property of our system, and we’re well on our way to knowing how to modify the system to fix the bug.

Mechanistic analysis requires the most effort, but yields the highest confidence in our understanding of a system.

Conclusion

Enumerating these 4 types of analysis from low effort / weak claims to high effort / strong claims has helped me coach teams on a pragmatic depth of understanding. To decide if we should invest effort in a product change, we’ll use predictive analysis. To learn exactly how much the product change improved the user experience, we’ll use causal analysis. And to ensure we thoroughly fix any bugs along the way, we’ll use mechanistic analysis.

October 31, 20212 minutes to read — Tags: mental models, management, leadership

Recently, I’ve found 2 leadership techniques particularly helpful. They’re both highly leveraged: they empower those around me to work effectively with less direct input and coordination from myself.

Completed Staff Work

Completed staff work is a rigorous “definition of done” when one needs to clarify and recommend a decision to someone else.

It’s useful when for the responsible person is different from the accountable person in a RACI matrix.

How I’ve applied it

When my colleagues present a decision that’s incomplete, my instinct is to do the additional work myself and model what I think is missing. Completed Staff Work helped me recognize that I overplay that Do-It-Myself technique (often at the expense of other priorities), and can instead give clear feedback about what’s missing

Commander’s Intent

Commander’s intent is about explaining “why” a particular task or instruction is important. Then If the particular task becomes unrealistic, knowing the intent allows others to creatively and independently solve the problem with less input from the “commander.” I think of it as the “spirit of the law” rather than the ”letter of the law“.

Knowing the intent empowers operators to better improvise and improve on rote instructions.

How I’ve applied it

I didn’t realize it at the time, but I used commander’s intent to evolve Glossier’s incident response process. We wanted to better detected and mitigate incidents. And we wanted to prioritize remediation work that would eliminate several types of failures. When this work inevitably ran up against product feature tradeoffs, the team was able to navigate those tradeoffs well referring back to the explicit intent of our incident response process, namely to continuously improve product quality and team productivity.

I encourage other leaders to use ‘completed staff work’ to teach their team to make clear decisions, and use commander’s intent to allow individual autonomy while staying aligned with the group.

October 20, 20212 minutes to read — Tags: incidents, site reliability

I appreciate reading stories of how complex software systems fail and the hard-earned lessons to make them more resilient.

Here is another of my personal favorite incident anecdotes. It sticks out because it helped broaden my thinking about appropriate uses for A/B tests.

Key lesson

A/B tests can be useful to verify that a software system works correctly—they’re not just for testing the user experience.

Setting the scene

My team had spent about 8 months rewriting our e-comm frontend from dynamic, backend-rendered HTML to a statically rendered frontend (called SSG for static site generation). The main goal of the project was to make our site more scalable (by reducing the need for backend rendering and DB queries), and reduce latency.

We began QA’ing the new SSG version of Glossier behind a feature flag with fancy Cloudflare routing config.

In order to quantify the revenue impact of the project, leadership requested we do an A/B test on the conversion rate.

The team and I were initially reluctant, since an A/B test for this particular infra migration required one-off support in our Cloudflare Workers. We hadn’t planned to A/B test SSG because it wasn’t an optional feature — we needed SSG for our Black Friday traffic.

But it’s fair to ask us to back up our aspirational claims with data. And boy were we surprised when the early A/B results showed SSG had a worse conversion rate than our slow, dynamically-generated control.

We dug in, and realized that almost no customers from the UK converted in our SSG treatment. That helped us pinpoint a typo in our localization code (en-UK instead of en-GB). This caused customers with a UK IP address to default to the US store. Confused, they’d bounce rather than change their locale in a footer widget.

Note that we’d certainly tested many international locales; but we’d tested it by manually changing our locale (which worked) rather than the geo-IP lookup that’s the default for most users.

We fixed the typo, re-ran the A/B test, and sighed with relief at a modest lift in the conversion rate.

The A/B test was useful for QA! It would have been more difficult and costly to find that typo had we launched without an A/B test.

September 26, 20213 minutes to read — Tags: productivity, management

Two colleagues recently asked me for personal productivity advice. I suspect one of them even gave me the Disco compliment:

Aaron Suggs always gets projects over the finish line.

Unfortunately, there must be some misunderstanding because I don’t feel like I get an unusual amount done, nor am I particularly strategic about it. My motivating principle is to avoid constant anxiety.

So here’s some free advice from an unqualified amateur.

First, I’ll point to better resources:

  • Getting Things Done (GTD) — I first read David Allen’s GTD book in 2005. It has a lot of durable, influential ideas. Though the implementation of notepads can be updated for a smartphone era.
  • Atomic Habits by James Clear. I really liked the emphasis on mindset and environment in changing habits.
  • The Rise and Fall of Getting Things Done by Cal Newport. This is a modern, wide-lens perspective on GTD and the personal productivity domain.

Here are some productivity techniques that I’ve found useful:

  1. Touch-it-once: Once a task has your attention, try to see it through to completion so you don’t need to ‘touch’ it or think about it again. For example, when I check my mail, if there’s a bill, I open and pay it right away (or better yet set up auto-pay). Then I can recycle the bill. I never set it down nor remember to pay it later. It means checking the mail sometimes takes a few minutes, but it doesn’t generate future work or accumulate in piles.

  2. Ubiquitous capture: Make it easy to leave notes to your future self, whether by your bedside table late at night or first thing in the morning, at your computer, in the car, or anywhere. I use the Reminders app on Apple Watch (usually via Siri), iOS, and macOS. And I use Things app to organize complicated projects. I organize my reminders to notify me when and where I can act on them. E.g. say in an 11am meeting we make a decision I need to communicate to my team. And say I’m busy until 3pm. I’ll make a reminder to share the decision with the team at 3pm. I can relax knowing that my system will notify me when I’m able to act on it.

  3. Write down the next action. If you need to interrupt a task (see #1 for why this should be rare), leave notes to your future self to make it easy to pick up where you left off. What were your about to do? On a project, you’re usually doing one of 4 things:

    1. Researching - understanding the problem
    2. Brainstorming - generating ways to solve the problem
    3. Communicating - getting approval/alignment, informing or training stakeholders about how it affects them
    4. Implementing - executing the work you brainstormed + said you’d do.

    If you’re stuck, ask yourself which one of those 4 things you should do to make progress.

  4. Be easy to follow: Write down your work process so others can imitate it. Put it in the first place you’d look for it in the future (code review comments, Jira ticket, wiki, etc). Share the checklist, notes, thought process that you went through. This feels like extra work in the moment, but pays off in the long-run.

  5. Know yourself. When do you focus best? What type of work is a chore that saps energy? Get the chores out of the way, and then treat yourself to the more enjoyable tasks. And don’t force yourself to be productive if you’re really not in the headspace for it. Focus on the work you’re able to do.

  6. Consider Satisficing vs maximizing: Ask yourself if this project benefits from a quick, low-effort satisfying and sufficient solution (i.e. satisficing), or a high-effort maximizing solution. Most of the time, the answer is satisficing.

Those are six strategies that help me remember details and stay focused. Please let me know on twitter if you have any to share.

September 7, 20211 minute to read — Tags: site reliability, programming

Below is the mental model I use when designing or reviewing web services to fail gracefully. It’s useful to enumerate common failure types, and ensure engineers intentionally plan for them.

For each network call the service makes (to a data store or API), consider how it would would behave if:

  • the call immediately returns an error. Do you have appropriate error handling, instrumentation, and retry logic?
  • the call never returns. Do you have an appropriate timeout? Can you determine if the network call should be retried? Ruby’s Unicorn web server has a concise introduction of application timeout considerations.
  • the responses are temporarily incorrect. Do you have logging and instrumentation to figure out which data are affected?

By addressing these 3 questions, you’ve built a solid foundation for an reliable, maintainable web service.

August 5, 20211 minute to read — Tags: values, management

Every team has values that guide their work. Of course, I like to write down some aspirational values, preferably in a charter.

Here are some of my favorites over the years:

Be exemplary: Our work, process, and demeanor should serve as an example to our coworkers, and to the industry in general. We aim to move the organization forward.

Blameless self-reflection: No one naturally likes talking about their failures, delays, or lack of understanding. Yet doing so is safe, healthy, and necessary to do better tomorrow. Strive to only make new mistakes.

Be approachable: Work collaboratively with other teams. Share responsibility and model a generative culture. Embrace and guide others scratching their own itch to do so in a way that benefits the whole team.

Explain why, and ensure it’s safe to ask why: Be transparent and explicit about how we prioritize, scope, and implement projects. Documenting the reasons and context for these decisions lets us easily adapt when we encounter new circumstances, and quickly onboard new team members.

Embrace experimentation with limits to avoid tech debt: Balance the virtues of consistent, familiar tools with innovative experiments. Experiments are the engine of innovation and continuous improvement.

And especially for platform/SRE teams serving internal customers:

Pave the common path: Common tasks should be easy (or automated!) and well-supported.

Empower our colleagues: Build simple, reliable processes and tools. Avoid being a bottleneck with self-service tools so others may help themselves.

July 28, 20214 minutes to read — Tags: incidents, site reliability

I appreciate reading stories of how complex software systems fail and the hard-earned lessons to make them more resilient. In fact, one of my favorite software interview questions is “tell me about a time you were involved in a production incident.”

Here is one of my personal favorite incident anecdotes. It sticks out because of the cognitive bias that slowed our diagnosis, and how thoroughly we were able to prevent similar incidents in the future.

Key lessons

  1. It’s useful to reconfirm basic assumptions if Subject Matter Experts are stumped.
  2. Listen to all the voices in the room.
  3. Thorough remediation means mitigating future failures in multiple independent ways.

Setting the scene

It was early 2015 at Kickstarter. Our Rails app used 3 memcached servers running on EC2 as a read-through cache. We were expecting a high-visibility project to launch in the coming days, so per our standard practice, we scaled up our unicorn app processes by 50%. In this case, that meant going from 800 to 1200 unicorn workers.

In prior months, we’d been battling DDOS attacks, so I was primed to expect unusual app behavior to be a new type of abusive traffic.

The incident

Out of the blue, our team was paged that the site was mostly unresponsive. A few clients could get a page to load within our ~60 second timeout, but more often clients got a 504 gateway timeout error. Several engineers, including myself, joined our incident slack channel to triage.

Digging into our APM dashboards, we saw that the public stats page saturating our database CPU with slow queries, which meant our unicorn web workers hung while waiting on DB queries to render pages.

That was strange because while the stats queries are slow, we kept the cache warm with a read-through and periodic write-through strategy. If the results fell out of cache, the page should hang for just a few seconds; not cause site-wide impact for several minutes.

“It’s as if memcached isn’t running,” said one prescient engineer. I ignored the comment, too deep in my own investigation. Memcached doesn’t crash, I thought. It must be our app bug, or some clever new denial-of-service vector to generate DB load.

After roughly 40 minutes of fruitless head scratching, the prescient engineer piped in, “I ssh’ed into one of the cache servers, and memcached isn’t running.”

If we’d had an Incident Manager role, we’d likely have checked memcached sooner.

Biggest. Facepalm. Ever.

The fix

Moments after we confirmed memcached wasn’t running, we restarted it with /etc/init.d/memcached restart, and the site recovered within a few seconds.

With the incident mitigated, our investigation continued. Why wasn’t memcached running? Our cache cluster had been healthy for years. The EC2 hosts were healthy. Yet each memcached process had crashed in the past few hours. Only in retrospect did we observe that the site was slightly slower as the first 2 crashed. We certainly noticed the near-complete outage when the final process crashed.

Digging through our app logs, I noticed sporadic connection errors to memcached. Apparently, we still had the default ulimit of 1024. So when we scaled to 1200 app workers, only 1024 could connect, and the remaining 176 would get errors. The Ruby memcached client would automatically attempt to reconnect every few seconds.

I was still puzzled why memcached had crashed, so I searched through the code commits for anything mentioning “crash.” And eureka! This commit mentions exactly our issue: as clients connect and disconnect when memcached is at the ulimit’s max connections, a race condition can crash the server. The default version of memcached that came with our Ubuntu version happened to predate the fix. I was able to reliably recreate the crash in a test env.

With all this in hand, the team implemented several fixes:

  1. I ported the default init.d script to runit, our preferred tool at the time, to automatically start processes if they crash. This would make the impact of the crash negligible.
  2. We increased the ulimit to accommodate more workers. This improved latency because ~15% of our workers were effectively without cache.
  3. We upgraded memcached to patch the ulimit issue.
  4. Send an alert if memcached isn’t running on a cache server to reduce our time-to-detect.

Items 1-3 are each sufficient to prevent this particular memcached crash from having a significant impact on our customers.

This was the first and only incident with memcached crashing in my 7 years at Kickstarter.

Wrapping up

This incident taught me to be a better listener to all the voices in the room, even if it means questioning assumptions that have served me well before.

And it taught me to be tenacious in tracking down causes for failures, rather than stopping at the first sufficient mitigation. Reading the source code can be fun and rewarding!

June 18, 20211 minute to read — Tags: vendors, management

I published Guiding principles for build vs. buy decisions on LeadDev as part of their Technical Decision Making series.

Here’s the conclusion:

I encourage you to consider ‘build vs. buy’ primarily from the lens of whether the opportunity merits a long-term strategic investment of your team’s attention, and less from the lens of short-term financial cost. Build if there’s an opportunity to make a significant improvement on the state of the art and create a competitive advantage for your organization. Buy it otherwise. And be ready to discard your competitive advantages of yesteryear as better alternatives emerge.

Together with Choosing software vendors well, it feels like a coherent a strategy.

June 2, 20212 minutes to read — Tags: books

Here’s a log of audiobooks I’ve listened to recently, with some notes.

Nonfiction

Sapiens: A Brief History of Human Kind by Yuval Noah Harari

This was a wonderful book. Some of my favorite points:

  • Corporations and nations are a collective fiction like religions.
  • Money is an uniquely valuable technology because it transcends culture.
  • Agriculture and the Neolithic revolution changed humanity in harmful ways, increasing the likelihood of violent conflict and poverty (reminiscent of Ishmael by Daniel Quinn).
  • The chapter on happiness and Zen Buddhism gave me galaxy brain.

The Essential Drucker by Peter F Drucker

A wide-ranging collection of insights on business management, many from the early-mid 1900s.

Good to Great by Jim Collins

I read this because it’s popular among Glossier leaders, with frequent references to a ‘flywheel’ and ‘getting the right people on the bus’. I found it quite valuable.

Some notes:

  • Humility and egolessness are critical leadership skills.
  • You are more likely to get revolutionary results from an evolutionary process than a revolutionary process. I.e. evolving a process is like compound interest.
  • Opportunity selection is more critical than opportunity creation.

Inspired by Marty Cagan

I read this b/c an Eng/PM friend recommended it when I confessed a lot of role confusion amongst PMs, Eng Managers, and tech leads. It’s a good primer on what Product Management should be. I particularly appreciated the emphasis on finding reference customers as a symbiotic partnership.

Doughnut Economics by Kate Raworth

A thought-provoking exploration of an economics that doesn’t assume indefinite growth. She argues that systems thinking (stocks and flows) are much more helpful to economics than trying to discover physics-like natural laws and constants.

Algorithms to Live By by Brian Christian and Tom Griffiths

I especially liked applying the multi-armed bandit approach to explore/exploit trade offs in everyday life (like whether to try a new restaurant).

Structure of Scientific Revolutions by Thomas S Kuhn

I re-read this for the first time since college. One point that really stuck out what that work on novel paradigms is often accessible to a non-academic audience. Examples were Newton’s Principia and Darwin’s Origin of Species. In contrast, as a paradigm is well-established, academic work becomes deeply niche and inscrutable without decades of training.

Turns out, the hard sciences are more subjective than we realize.

Fiction

  • His Dark Materials trilogy by Philip Pullman
  • The Broken Earth trilogy by N. K. Jemisin
  • The Yiddish Policemen’s Union by Michael Chabon
  • Project Hail Mary by Andy Weir. Rocky! I liked The Martian and Artemis. This is my favorite of the three. Weir has really found his groove.
  • Death’s End (Remembrance of Earth’s Past trilogy) by Cixin Lui

May 23, 20211 minute to read — Tags: management, release engineering, productivity

There’s a crucial moment in platform engineering projects when you decide it’s ready to ship. For large projects (say, more than 1 year of engineering effort), it can be a difficult decision for the team. More cautious contributors want to delay for further testing and polishing. Other teammates inevitably begin to shift their attention to their next project, and are eager to move on.

I’ve found a simple criteria for navigating the risk/reward trade-off for launching a complex project:

Ship when the project is an improvement on the status quo.

If current engineering risks make it uncertain that it’s an improvement, continue testing and fixing defects until it’s clearly an improvement.

And all those extra features on the backlog: you can still build them, but it’s not worth withholding the value you’ve already created while you do so.

I’ve found this to be particularly useful for architecture migrations or component rewrites where achieving ‘feature parity’ with a deprecated implementation or ‘feature completeness’ of our ideal product is difficult or not worth the opportunity cost. Agreeing to ship a feature when it’s a net improvement over the existing solution ensures our team delivers value as quickly as possible, and helps us focus our effort on the most impactful work.

May 16, 20213 minutes to read — Tags: programming, release engineering

My team has been discussing the role of various test and development environments. We’d like to provide guidance for what developers should test locally on their laptop, on an ad hoc deployed environment, on a pre-prod environment, and on production.

I’d like to share some criteria that help me organize the value and purpose of various environments.

Let’s start with 3 key features of an ideal dev env:

  • Fast feedback. It should be as quick as possible to change some code in your editor and see the effect in your dev environment.
  • Similar to production. The environment should be maximally similar to the full production system. There are several aspects in which it may be similar or different, such as the infrastructure it runs on, configuration, data, and traffic load.
  • Isolated. Any side effects and failures should be isolated from impacting actual users, or even your colleagues (e.g. minimize the time that one person breaking the build blocks teammates).

In practice, fast, similar, and isolated aren’t so much features, but continuous dimensions that we try to maximize. We can carve out roles for various dev envs by considering the relative importance of each dimension.

Local development

For local development environments (i.e. code running on your laptop), I’d rank the importance as follows:

  1. Isolated
  2. Fast
  3. Similar

In other words, it’s most important that local envs are isolated from breaking anything on production or anyone else’s environments. The 2nd priority is fast developer feedback as long as it doesn’t compromise isolation. And the 3rd priority is being production-like, as long as it doesn’t compromise isolation or fast feedback.

A feature like Webpack’s Hot Module Replacement and React Hot Reloading improves feedback time, but detracts from being production-like. So that’s a win for local development since ‘Fast’ is more important the ‘Similar’.

By similar reasoning, local development is a good place to run uncommitted code, or dynamically generating assets that would be immutable deploy artifacts on production.

Testing on production

What about practices that let you more safely test on production, like feature flags and blue-green deployments? I see the ranking as:

  1. Similar
  2. Isolated
  3. Fast

‘Similar’ is de facto top priority since it is production. Next up, our goal is to isolate failures and unintended side effects as much as possible. And finally, we want fast feedback as long as it doesn’t compromise isolation.

Other deployed environments

Where does that leave environments like staging, QA, or other quasi-production like environment? For decades, they’ve been a middle-ground between local development and production.

As release engineering and local development tooling improves, I’m finding fewer reasons to maintain them. More likely, I’m going to invest in ways to gain confidence in my code locally, or build ways to safely test it on production.

Let’s recall the aspects in which an environment can be production-like: infrastructure (as in the CPU and memory resources, operating system, and system libraries), configuration, data, and traffic.

Years ago infrastructure and configuration were a frequent sources of bugs. Code might work on a developer’s macOS laptop, but not on Linux server. Or we forgot to set all the environment variables we expected. Staging environments were a critical place to suss out those bugs. Lately, Infra-as-code tooling and better configuration patterns like Terraform, CloudFormation, and Docker have made these rare issues.

Most bugs I see on production today are related to data (i.e. unexpected states) or traffic (unexpected resource contention or race conditions). Those are particularly difficult to suss out in non-production environments.

Sometimes creating these non-production integration environments means adding and maintaining new code paths. For example, for Stripe’s sandbox environment, Stripe maintains different payment cards that always succeed or return predictable errors. That’s unique behavior to the sandbox environment not found on production. In order to be useful for isolated testing, they had to compromise on being similar to production. As I think about a constellation of microservices that could make up a complete test environment, the support cost of these alternate code paths can add up quickly.

For SRE / Platform / Release Engineering teams tasked with supporting developers on the entire delivery lifecycle, we must choose where our attention can have the most impact for the organization. I’m finding that ever more often the focus in on fast local development and safe production releases, and there are fewer reason for maintaining non-production deployed environments.

Check out ”The value of reliable developer tooling” for some of my prior work on dev envs.

March 27, 20212 minutes to read — Tags: capacity testing

A friend recently asked how to set better capacity testing goals for their tech team. We agreed we needed to get more specific about what “capacity testing” means.

Below are the key terms I’ve found helpful. I go into more detail about these in my talk Surviving Black Friday (particularly at the 8:00 mark).

Expected traffic

This is a plausible traffic volume based on prior data and forecasts. I encourage this number to come from cross-functional consensus from Eng, Data, Marketing, Product, Sales, etc. Express it as a rate like reqs/sec or orders/min.

If the expected traffic is too low, the service could crash from insufficient capacity. But if it’s too high, you’ve wasted scarce engineering effort building capacity you don’t need (analogous to excess inventory in a supply chain). Site downtime is usually more costly than wasted engineering effort. The virtue of highlighting wasted engineering effort is to recognize the marginal effort and opportunity cost for engineers to support greater capacity.

Safety factor

This is the multiplier or fudge factor to give yourself breathing room. I’d suggest 20-50x for early-stage startups that don’t know when they’ll go viral, 5-20x for growth stage businesses, and <5x for mature businesses with robust prior data and tightly-managed sales/marketing plans. At Glossier, we currently use a 10x safety factor. We were bitten in 2018 with a 5x safety factor and insufficiently detailed traffic forecast.

Capacity target

This is what you’re aiming for. capacity target = expected traffic * safety factor

So assuming expected traffic of 500 req/sec, and a safety factor of 10x, your capacity target is 5,000 req/sec.

Demonstrated capacity

This is the load that your engineers have proven your system can handle during their load tests. Keep scaling and removing bottlenecks until your demonstrated capacity exceeds the capacity target.

Pro tip: run your load tests for several minutes (we do an hour) to “soak” your infrastructure (kudos to Rajas Patil for introducing this idea). This can reveal bottlenecks that don’t show up in quick tests. For example, at Glossier, the data replication from our customer-facing DB to our BI data warehouse was significantly delayed during a 60-minute soak tests, but we wouldn’t have noticed during a quick 5-minute test. By detecting the replication delay early, we had time to mitigate it.

March 14, 20213 minutes to read — Tags: management

The Glossier Tech team wrapped up our annual roadmap exercise earlier this year. It takes a lot of time and attention from the team, especially managers.

I wanted to share some tips I’ve gleaned to make reviews easy and productive. They’re organized into ‘filters’, or questions to ask about each project in a roadmap.

If product and engineering managers can speak to each of these filters, they’ll likely have a smooth review with no surprises.

Virtually all the questions and feedback that came up in our roadmap reviews fall into one of the filters below; and each one is a hard-learned lesson from watching my projects or teams stumble.

1. Sufficiently detailed

The appropriate level of detail increases during the roadmap process. In general, sufficient detail means that project outcomes and requirements are defined, and that key decisions and risks are highlighted and investigated. A 6-month project may not have a clear approach at the beginning of the road mapping process, but by the end it would likely have specific, realistic outcomes for each 2-week sprint.

Having documented examples of projects plans with the appropriate detail is helpful here (see Will Larson’s Discouraging Perfection). Some people take roadmapping too seriously, going into so much detail that the precision of their plan exceeds it’s accuracy. They get frustrated when they need to adapt to the unexpected. Others can be too casual or hedge so much that it’s difficult for others to depend on them. The key is the psychological safety to acknowledge that plans are imperfect and will inevitably change. The point is sufficient detail to reduce risks, not complete and rigid precision.

2. Aligned with business goals

Are these projects sufficient to meet the team’s mission and biz goals? If not, change up the projects, or set more realistic goals. For example, if a goal is to increase a conversion rate by X% this year, but the projects to improve conversion ship at the end of the year, they likely won’t have a significant impact on the conversion rate and there’s little time to respond.

3. Comprehensive of all work

Does this roadmap account for all the work the team will have to do? If a team spends 20% of their time responding to bugs filed by the customer support team, that should be accounted for in the resource planning. We call this Keep The Lights On (KTLO) work.

4. Sequenced effectively

Which projects have strict deadlines? Do Team A’s projects depend on one of Team B’s projects? Does Project X become easier or more valuable if we do Project Y first? A group roadmap review is one of the more obvious places to suss this out.

5. Resourced for success

Does the team have appropriate people and skills to deliver each project? What skill gaps or “bus factors” are there? What’s the plan to get those skills (hire, train, or borrow)?

6. Iterative milestones

Can you frontload more business value? I.e. be more agile and less waterfall. Are there narrow customer segments or journeys that you could support early on while you develop the rest of the project? Are there milestones that de-risk the project and enable real feedback as soon as possible?


Having presented and reviewed several roadmaps, I’ve found these filters to be a helpful linting tools to make more useful roadmaps. Or at least they allow me to learn new ways to fail rather than repeat my previous mistakes.

March 10, 20211 minute to read — Tags: management, productivity

While reviewing how I’ve spent my time recently, I stumbled into a practice to better ensure I have sufficient flexible time for serendipitous projects. I’ll aim to schedule a max of ~80% of my time for inflexible work like group meetings.

The practice was inspired by ”hara hachi bun me”, the Confucian practice of eating until you’re 80% full.

Dysfunction of the over-booked calendar

What’s the harm in scheduling every minute of your day? If appointments are hard to move, it adds friction to say ‘yes’ to unexpected opportunities. I found myself disinclined make time if it had administrative overhead like rescheduling meetings and delaying project timelines.

Applying some lean production theory, as your schedule becomes 100% utilized, the wait-time for any new task approaches infinity.

Remove friction to optimize your schedule

The solution is me noticing as my schedule fills up, I’ll more aggressively block off flex time on my calendar. To be sure, I find it very useful to be intentional with every minute of my schedule. Adrian Cruz describes this well in The Power of Quiet Time. So while I may have an hour or two blocked off for writing docs or making a prototype; I consider that flex time becaue there’s low friction to reschedule that time to help debug a complex issue, or have impromptu discussions.

Some flex time in your calendar makes it easy to say ‘yes’ when opportunity knocks.

February 18, 20213 minutes to read — Tags: books

In mid-2020, I got an Audible subscription as a substitute for doomscrolling through social media.

It turns out I enjoy listening to books far more than reading them. Here are some books I enjoyed in the last 6 months (follow my my Goodreads profile for more):

Nonfiction

  • The Machine That Changed the World by James P Womack. One sentence book review: Rigorous insights into Japanese lean manufacturing and keiretsu; emphasizing the success is not due to a national or cultural identity, but a set of practices thoughtfully applied.
  • Accelerate: The Science of DevOps and Lean Software by Nicole Forsgren: software teams should focus on deploy frequency, lead time, TTR, and change fail rate.
  • An Elegant Puzzle by Will Larson: bring expansive and systematizing mindset to every technical and management challenge; then work the process (not the exceptions).
  • The Goal: A Process of Ongoing Improvement by Eliyahu M. Goldratt: identify and eliminate bottlenecks to improve throughput. Bottlenecks can be subtle or unintuitive.
  • Don’t Think of an Elephant by George Lakoff: Controlling subtle and implicit metaphors has huge leverage to frame political debates. Personal values can often be grouped into a “dominant father” or “nuturant mother” mindset. (With Sapiens, I’m realizing this parallels chimpanzee and bonobo social heirarchies as well.)
  • Reimagining Capitalism in a World on Fire by Rebecca Henderson: the current corporate rules and norms undermined the long-term health of society, so leaders should advocate to change the rules for healthier incentives.
  • This Could Be our Future by Yancey Strickler: having an expansive and long-term notion of value (beyond, say, money) clarifies purpose.
  • Lives of the Stoics: The Art of Living from Zeno to Marcus Aurelius by Ryan Holiday. Stoics are surprisingly varied and relatable. I particularly appreciated the portayal of Seneca as a flawed moderating influence on a corrupt leader.

Fiction

February 14, 20212 minutes to read — Tags: bento, values

In January, I joined the Bento Society as a weekly practice in long-term thinking.

The society is born of Yancey Strickler’s book This Could Be Our Future. Bento stands for Beyond Near Term Orientation, and a play on the neatly separated Japanese lunch tray.

In it’s simplest form, the Bento is a square divided into quadrants; with the x-axis being time (now and the future) and the y-axis our self-interest (me and us).

blank bento

One powerful application is to use the quadrants to tap into important parts of your identity. Write a question like “what should I do today?”, and envision how each quadrant would answer.

'what should I do today?' Bento

“Now me” (your short-term self-interest) might want to binge watch Netflix, or knock out a work project that’s been on your mind.

“Now us” (your short-term group-minded self) might want to talk with a family member going through a hard time, or reconnect with an old friend.

“Future me” (your long-term self interest) might want to work on a passion project, or practice a new skill.

“Future us” (your long-term group-minded self) might want to apply a new skill in a way that benefits your community.

All too often, I find that the “now me” gets to drive my life. Thinking through the Bento quadrants helps me balance near- and long-term interests; and balance self-care and service to others. It’s not about judging certain quadrants as good/bad or right/wrong; simply that no one quadrant is the complete picture of what matters.

After doing several Bentos, the exercise highlights the values and guiding principles that I want to more thoroughly practice, like curiosity and compassion.

Participating in the Bento Society has been helpful way to ground and orient my values and daily habits.

February 4, 20212 minutes to read — Tags: vendors

If you follow the strategy of avoiding undifferentiated heavy lifting, you’ll inevitably integrate with many software vendors. For example, my e-commerce software team at Glossier has vendors for our cloud hosting, CDN, CMS, payment processing, CI/CD pipelines, code hosting, internal docs, observability, and alerting to name a few.

Here are a few criteria I’ve found particularly valuable when choosing vendors.

1. Emphasize rate of improvement over current feature set

Prefer a vendor that’s sufficient today and improving quickly over a dominant-yet-stagnant vendor. I judge vendors’ rate of improvement by their recent feature announcements and possibly a roadmap presented by an account rep. In other words, consider which vendor will likely have the best product ~2 years from now; not just the product as it exists today. Skate to where the puck is going.

This criteria is particularly helpful to compare small, disruptive innovators with current market leaders.

Of course, if the current market leader is also innovating quickly, you’re lucky to have an easy decision.

2. Backchannel referrals / Ask an expert

In my StaffEng interview, I shared this anecdote:

Our team was recently choosing a new vendor and the team was split between two mediocre choices. I asked an acquaintance with expertise about the vendors how he would choose; and he recommended a lesser-known new vendor that quickly became a universal team favorite.

To expand on this example, it was an an area that our team had little expertise. It was difficult for us to determine what features really matter; and set realistic expectations. Asking experts in your professional network can bring clarity and confidence.

3. Emphasize net value over cost

From the Vendor relationship HOWTO:

The goal is to maximize our organization’s long-term value from the vendor’s service.

In contrast, I’ve sometimes seen teams try to minimize cost, ignoring gross value. This is short-sighted.

Suppose Vendor A costs $25k/yr and adds $200k of gross value to the org ($175k net value); while Vendor B costs $100k and adds $500k of gross value ($400k net value).

Choose Vendor B because of the higher net value, even though it’s more expensive than Vendor A.

To be sure, I don’t know how to assess the gross value derived from any vendor beyond a hand-wavy estimate. Here are some techniques I use; though I’d certainly like to learn more.

One technique is to look at productivity improvements. If a tool saves each engineer 1 hour per week, it’s a 2.5% productivity improvement; so it’s gross value is roughly 2.5% of your total Engineering payroll.

Other times vendors add capabilities or controls that change how the team works, so you can’t easily assess productivity. In this case, speculate about how much value your org gets from that capability or control. E.g. an A/B testing tool adds the capability to rigorously measure the impact of product changes. The gross value is the difference between the product features you ship using A/B test feedback versus product features you would have shipped without A/B test feedback. Security tools add controls that constrain types of risk. The gross value is difference from the liability of the unknown/unconstrained risk versus the better-known/constrained risk.

January 25, 20213 minutes to read — Tags: vendors

Creating and sustaining vendor relationships can be a highly leveraged skill for software engineering teams. But there’s little guidance or structure at small companies for folks learning to build vendor relationships.

So here’s the template of bare essentials and some nice-to-have responsibilities to steer emerging engineering leaders in vendor management. This is born from experience at tiny startups to growth-stage companies with hundreds of employees. Larger companies have more formal processes for choosing and managing vendors.

This post won’t go cover how to choose a vendor and the famous “build versus buy” calculus; instead focusing on what to do after you’ve chosen a vendor.

Without further ado, the template:


The goal is to maximize our organization’s long-term value from the vendor’s service. That means we use their service appropriately, and spend money efficiently.

Minimum essentials

  1. Each vendor should have a Directly Responsible Individual within the org. The DRI is responsible for the items below.

  2. Follow our org’s legal review process. Before you accept terms of service or sign anything, familiarize yourself with your company’s signing authority and approval process. In short, give our legal team a heads up; and they can help navigate contract discussions, particularly around liability and data privacy issues.

  3. Follow our org’s billing process. Give our accounting team a heads up to coordinate who keeps track of invoicing and receipts. Very small companies tend to use corporate charge cards. As they grow, it tends towards invoices and purchase orders with formal approval processes.

  4. Know how to contact the account rep, escalate tech support tickets, or otherwise get high-quality, timely technical assistance. Preferably, this contact info is stored in a well-known, discoverable place for all vendors. (We use Blissfully.) This is particulary important for business-critical vendors like payment providers and CDNs.

  5. Keep payment information up-to-date to avoid service disruptions; and make sure invoices are approved/paid on time. Check your emails!

  6. Use a vendor-specific email list like [email protected] for all communication with the vendor. As our team grows and we onboard new member, they can easily review and join discussions. As the DRI, you’re responsible for staying on top of this email list.

  7. Ensure money is spent effectively. Should we change our terms to reduce our bill (like commit to a larger quota to reduce overage charges)? For large contracts (>$15k/yr), negotiate with the vendor (the finance team can help with this).

  8. When contracts are expected to change or expire without renewal, inform stakeholders with ample time to implement alternatives.

  9. Ensure the process for onboarding and offboarding employees with the vendor is documented clearly.

  10. Maintain a list of the PII and sensitive information that’s shared with the vendor. Your legal team can help ask the right questions here.

Nice-to-have strategic considerations

Here are some next-level ways to derive significantly more value from your vendor relationship:

  • Maintain a clear sense of the value this vendor provides the organization. Tech vendors typically use value-based pricing (as opposed to cost-based pricing), so being able to describe the value of various features ensures you and the account rep speak the same language.
  • Track how closely our usage aligns the vendor’s typical customer usage. Do we use their service in a common, expected way; or in a custom, unusual way that could be a strategic risk as the vendor evolves? Are we one of their biggest/smallest customers (another strategic risk), or middle-of-the-pack?
  • Maintain a general sense of the competitive landscape and alternatives for the vendor. What’s our next best alternative if we had to move off this vendor? Are there competitors who have a superior service or are gaining quickly? When would it be worth the opportunity cost to build it ourselves?
  • Track and contribute to the vendor’s private roadmap (beta features). Usually the account rep will offer to discuss this once or twice per year.

Congrats, you’re well on your way to a productive, valuable vendor relationship!

January 21, 20218 minutes to read — Tags: leadership, management

This interview originally appeared on StaffEng. I wanted to share it here as well.

Tell us a little about your current role: where do you work, your title and generally the sort of work that you and your team do.

I work at Glossier, a direct-to-consumer growth-stage skincare and beauty company with incredibly passionate customers. Our engineering team is ~35 people. I’m a Principal Engineer, mostly focusing on our Site Reliability and Tools team. My recent focus has been leading Glossier’s Operational Excellence initiative (nicknamed ✨GLOE✨) and ensuring we’re building scalable services and team practices. I define operational excellence as our ability to deliver low defect rates, high availability, and low latency for product features. In practice for the SRE/Tools team, that means improving observability, increasing our infra-as-code adoption, and shepherding our migration from a monolith to microservices.

In the Staff Eng Archetypes, I gravitate most towards being a right-hand, and secondly a solver.

Prior to Glossier, I was a Director of Engineering at Kickstarter. In 2018, I joined Glossier as a Senior Staff Engineer (an IC role), and as the first engineer to focus primarily on internal tools and engineering practices. My first projects were building a feature flag system so we could safely and easily test features with real data; then implementing continuous deployments to accelerate delivery.

After a few months, I switched back to management to lead a new Platform team and prepare for Black Friday. Glossier has an annual Black Friday sale that generates a huge spike in traffic and revenue, and our ambitious growth targets showed we need to rigorously prepare with capacity testing, system hardening, and cross-functional collaboration (See Surviving Black Friday: Tales from an e-commerce engineer for details on Glossier’s Black Friday prep). After some re-orgs, the Platform team wound down, but the current SRE/Tools team does similar work. A year ago I gave up my management responsibilities to more deeply focus on operational excellence.

Did you ever consider engineering management, and if so how did you decide to pursue the staff engineer path?

Absolutely! I’ve switched from manager to IC twice in my career; and I’ll likely do so again.

When I first became a manager in 2015, it was the only career path for a senior engineer at my company. Fortunately, ever-smaller engineering teams soon created and shared career ladders with parallel IC and management tracks. When I helped create Kickstarter’s engineering ladder, I emphasized IC growth paths that didn’t require people management.

I was deeply influenced by a section of Camille Fournier’s Manager’s Path that called out “empire building” as a toxic management practice. It reminded me of the argument in Plato’s Republic that the political leaders shouldn’t be those that selfishly seek power, rather those whose wisdom makes them duty-bound to lead.

So I don’t orient my career around ever-greater management responsibilities: it’s one tool in the toolbox. I appreciate management as a rich discipline that I’ll spend my career honing; alongside programming and systems engineering.

Here are some important factors for me when switching between manager and IC roles:

  • What skills does the team need most acutely: management to coordinate the actions of a group; or an IC to accelerate the execution?
  • Will I have sufficient support and feedback to learn and succeed?
  • Am I the only one on the team who could do this; or could others do it well?

Can you remember any piece of advice on reaching Staff that was particularly helpful for you?

“Replace indignation with curiosity.”

Several years ago, I told my manager about another team behaving in a way that caused problems for my team. When I finished, he gave me that advice. I hadn’t been curious about why the other team was acting that way. It turned out they had constraints that made their behavior quite reasonable. By approaching them with curiosity and a helpful mindset (instead of frustration), we quickly found a process that improved both our workflows.

More recently, while struggling with burnout, a career coach asked me, “What would let you approach each day with energy and optimism?”

It’s become my morning mantra, ensuring that I make time for operational excellence and mentorship and bring genuine enthusiasm to my work.

How do you spend your time day-to-day?

My days are roughly 50% scheduled meetings, 35% deep-focus blocks, and 15% unplanned work.

I work hard to make sure the meetings are effective. That usually means at least having an agenda. The meeting should have a clear purpose known to attendees beforehand, such making a decision, generating ideas, or reviewing information. Meetings often have a negative connotation because they’re facilitated poorly; but they can be incredibly productive. I try to get better at facilitating productive meetings and using synchronous attention well. High Output Management by Andrew Grove is a great resource to learn about effective meetings.

A technique I recently learned from my CTO is to schedule reading time at the start of a group meeting. Say you’re in a hiring debrief: everyone spends the first 5 minutes reading each other’s feedback about the candidate. It’s a great way to ensure attendees truly read the document and have it top-of-mind. It ultimately saves time and elevates the subsequent discussion.

I also interview quite a bit. In 2020, I did (checks calendar) 126 interviews. Improving the long-term health of the team is a key Staff+ responsibility; and helping us hire great people is part of that.

The deep-focus blocks are marked off on my calendar. My company observes “No Meeting Thursday” which helps a lot. I use these blocks for work that’s ‘important but not urgent’ from Eisenhower’s productivity matrix. That’s usually writing specs and documentation, or researching and prototyping new tools and patterns.

My schedule is unusual in that I stop work around 4pm most days, then work later in the evenings, ~8-10pm. This gives me several high-quality hours with my family each day. I have difficulty concentrating in the afternoon, and can more easily concentrate at night. And I enjoy getting something done right before bedtime. So this schedule has improved both my work/life balance and productivity. I changed my schedule because of childcare needs during the coronavirus pandemic; but I think I’ll keep it long-term. I encourage everyone to reflect on what habits and schedules are helpful for their work. An open discussion with your manager and some flexibility can go a long way.

The unplanned work is mostly answering Slack messages, advising on urgent issues, or sometimes responding to a production incident. I try to approach this work with a helpful attitude, and also with an eye towards cross-training and writing discoverable documentation to minimize future unplanned work.

Where do you feel most impactful as a Staff-plus Engineer? A specific story would be grand.

I think of my impact in two ways:

  1. Working the plan
  2. Serendipity

‘Working the plan’ is about making daily, incremental progress on a big project with a team. Some examples have been improving our site availability from under 99% to over 99.95%. It took a lot of Learning Reviews (blameless postmortems), training, testing, and refactoring. Another was a 9-month migration from dynamically-generated Rails-based HTML pages to statically-generated React-based ones to improved time-to-first-byte and availability. It took a lot of coaching, buy-in, and coordination. To successfully work the plan, you need clear goals and incremental milestones to keep the team motivated, and continuous alignment with leadership on the desired outcomes and timeline.

‘Serendipity’ in my work is about sharing an insight with the right people at the right time to make a positive impact. For example, our team was recently choosing a new vendor and the team was split between two mediocre choices. I asked an acquaintance with expertise about the vendors how he would choose; and he recommended a lesser-known new vendor that quickly became a universal team favorite.

Another serendipitous example was an engineer mentioning during standup that a caching optimization wasn’t having impact they expected. I happened to be familiar with the config options of the particular Ruby web server; and was able to interpret some complicated metrics on a dashboard they showed to determine we had misconfigured a memory threshold. Later that day, we made a one-line config change to optimize our memory usage that reduced latency by 30%.

Serendipitous impact isn’t planned; and isn’t necessarily hard work. It’s about paying attention (being present), keeping a curious mindset, and sharing the insight in a way that colleagues are open to receiving.

How have you sponsored other engineers? Is sponsoring other engineers an important aspect of your role?

Certainly! As a Principal Engineer, I try to be an enthusiastic and conspicuous first follower when other engineers are doing important new practices. Some examples are when colleagues demoed React snapshot testing and local development with Docker. After each demo, I’d ask how I can try it out and see the benefits for myself. Then I’d look for other teams and in-flight projects where we can apply these practices to get wider adoption.

I also ‘cheerlead’: recognizing a colleague’s valuable effort in public or a small group, even if the outcomes aren’t tangible yet. It could be complimenting a team that’s was thorough and reflective during a difficult Learning Review; praising an engineer who reproduced a tricky race condition; or thanking someone who documented a poorly understood process.

I aim to serve two purposes with cheerleading: recognize those doing the valuable behavior, and give positive reinforcement in the hopes that the team does more of that behavior. It’s really operant conditioning, but cheerleading sounds much nicer.

What about a piece of advice for someone who has just started as a Staff Engineer?

Other engineers look up to you as a role model, some in ways you may not expect. They’ll emulate your coding style, your tone in code reviews, your behavior in meetings, your rationale for making decisions, and the way you treat colleagues.

It can feel like a lot of responsibility to be perfect all the time. But it can also bring clarity to your work: do your best, acknowledge shortcomings, be generous and curious.

January 18, 20212 minutes to read — Tags: programming, documentation

A well-crafted GitHub pull request can be a powerful way to show others how to extend and maintain a component. These ‘Exemplary’ PRs highlight the code and practices you want others to emulate.

A few years ago, my Platform team was implementing a new GraphQL API. We found engineers needed a lot of support and code reviews to add new mutations in our app. One of our lead engineers used a new mutation as an opportunity to create an exemplary PR.

The exemplary PR for a GraphQL mutation showed:

  1. The new class to create and interface to implement
  2. How to register the new mutation with the server
  3. How to handle authentication/authorization
  4. How to validate the object and handle validation errors
  5. Instructions for how to test the mutation locally, what automated tests to create, and how to manage test state

It turned out to be highly leveraged effort! As we pointed engineers to the exemplary PR, they were able to easily create high-quality mutations while also needing less support from the Platform team.

Recently, I had the opportunity to help create another exemplary PR. Our SRE team wanted to make an easy process for Eng Managers to maintain their team’s PagerDuty on-call schedules using Terraform. We created a simple pagerduty_team module that only required a few parameters, like the name of the team and a list of emails of the on-call members. That way managers didn’t need to learn a bunch of Terraform provider details just to maintain their on-call rotations.

I worked with an EM to craft an exemplary PR, adding her team’s rotation, and being sure to add explanatory comments about how our CI/CD pipeline applies the changes. As other EMs asked how to set up their on-call schedule, we’d just send a link to that PR. It was obvious what values to substitute.

To be sure, we had more documentation about our Terraform setup; but making the PR the one-stop-shop ensured EMs could get their rotations set up in minutes without much reading or back-and-forth.

Engineers naturally look for similar code in a repository they can use as a starting point for new features. Creating and labeling exemplary PRs is a helpful way to highlight the code you want them to emulate.

December 31, 20201 minute to read — Tags: management, career, personal

In late 2019, I was burnt out in my Director of Engineering role. I spent several sessions with a career coach outlining my professional challenges. Teams lurched from crisis to crisis. Various teams either lacked a coherent strategy, or lacked the alignment or resources to execute it effectively. Frequent confusion about roles and responsibilities caused tension. I didn’t have the resources to fix it all.

My coach finally asked:

“What would let you approach each day with energy and optimism?”

The question felt like reaching a vista after a long hike. My mood lifted as answers leapt to mind. I love being a small part of a big success. I love coaching and cheerleading colleagues working on something difficult and important. I love pairing—learning and teaching simultaneously—and fist pumping when we track down a bug. I’d be interested and excited to tackle each of my company’s particular socio-technical challenges in a focused, disciplined way. But to make time for that, I needed to significantly change my role.

I shared the revelation with my manager; and a few short weeks later, I handed off management responsibilities to a colleague. I became a Principal Engineer rather than Director. I’ve spent the past year mostly as an individual contributor, and mostly loving my work.

My coach’s question has become my mantra as I set my daily intentions. It’s honed my ability to focus on where I can make meaningful progress, and let go of the rest. It helps me orient my schedule around what’s important rather than what’s urgent.

In 2020, COVID and an immunocompromised family member upheaved my daily routines. My household navigated remote schooling and daycare with two working-from-home parents. Throughout these changes, I’m thankful for many blessings. In particular, I’m thankful for this mantra, which helped me adapt to new roles at work and at home. It’s improved my satisfaction both at work, and with my family.

As I think of goals and intentions for the new year, I’m asking myself, “what could I work on with genuine energy and optimism”?


Aaron Suggs

ktheory is the personal blog of Aaron Suggs, a software engineering leader in North Carolina.

Copyright Aaron Suggs 2022 and licensed under Creative Commons Attribution 4.0.