ktheoryAaron Suggs’s blog

November 24, 2021 — Tags: personal

This is the eve of Thanksgiving in the US, an occasion to reflect on what we’re grateful for. This year, I’d like to share my gratitude publicly.

First and foremost, I’m thankful that my close family and loved ones are healthy, and we generally feel safe and secure. This has been the case virtually all my life. But it wasn’t the case last year, so I appreciate all the more this year.


I’m thankful for good advice to view life’s challenges though a useful lens. Early in the year, I shared with an older friend that I felt disoriented about where my career and personal relationships were going. “Of course, you’re at the age to feel that way,” said my friend.

They went on to explain that I’d established a career, and I’d gotten through the early years of parenting. Now my spouse and I were finding ourselves with more time and energy than we knew what to do with. They said, “Now is the time to rewire your relationships for the next few decades.”

I love that metaphor: rewire relationships for the next decades. It unlocked fantastic questions. How do I want to treat my spouse, and how do I want to be treated, from now into retirement? What’s the relationship I want with our kids? How do I want to interact with my community? What’s my professional identity and relationship to work?

This year, I’m grateful for those focusing questions, and the effort I’ve put towards answering them.


I’m thankful for learning, both about myself and others. My personal therapy has revealed that I’m often unable to acknowledge my own negative emotions. My beginning to do so feels revelatory. It feels like uncovering a straight path through what seemed like a labyrinth.

I began reading a book each week (and updating my StoryGraph profile). It feels like compound interest for developing interesting ideas.

I’m thankful to my colleagues at Glossier from whom I’ve learned so much, from hiring and interviewing, how to plan and coordinate large initiatives, to building trust and alignment, and when to be scrappy and pragmatic. Today is the first Black Friday weekend in four years that I’m not primary on-call. It’s a great team, and I’m humbled to be part of it.

I’m grateful for new opportunities. In December, I’m starting a new job that feels like a profound blessing. I think it’s the flow state I’ve been craving, challenging me to the brink of my abilities. But that’s a story for another day.

Happy Thanksgiving.


November 7, 2021 — Tags: epistemology, mental models, programming

One of the more powerful concepts I’ve found is considering which type of analysis to apply to the challenge at hand. It’s been especially useful when coaching a software team to put in more and different effort to find the root cause of a failure, or occasionally to save effort when over-analyzing a situation.

So here’s my survey of ‘types of analysis’, inspired by (and expanding on) a section of Accelerate, chapter 12.

Forensic Predictive Causal Mechanistic
Confidence of claims Low, new evidence could drastically change conclusions Modest, based on suppositions about the past and present contexts Higher, this was definitely true in the past Very high, claims will be true forever
General uses Paleontology, criminal investigations Meteorology, government policies Clinical trials Rocket science, material science
Software development uses Security incident investigations Roadmap planning, project selection, scoping, and sequencing A/B tests of user behavior Unit testing, debugging

Forensic analysis

Forensic analysis seeks to understand an incident when it’s impossible to fully recreate the incident.

Forensic analysis gathers evidence, then uses abductive reasoning to find the most reasonable explanation that fits evidence. When criminal detectives and attorneys argue a theory of a crime, or paleontologists describe how dinosaurs behaved based on fossil records, they’re using abductive reasoning. Any claims based on forensic analysis and abductive reasoning are contingent on the available evidence. It’s always possible that new evidence comes to light that dramatically changes our understanding of an incident.

In software development, we use forensic analysis to investigate security incidents, or when we search through logs to figure out when or how something occurred.

Forensic analysis is often the least effort to perform, and the least certain or generalizable in its conclusions.

Predictive analysis

Predictive analysis uses historical data and inductive reasoning to make claims about the future in an uncontrolled environment.

When a meteorologist makes a weather forecast, or when the CBO forecasts what a new tax policy will cost over the next decade, they’re using predictive analysis. When lawmakers consider the pros and cons of passing a law, they (ideally) use predictive analysis.

Because these analyses are uncontrolled, it’s impossible to repeat an experiment. That is, if you make a decision based on forecasts, you can’t know precisely what would have happened if you’d made a different decision. Like, what would society be like if the government had passed a different law years ago? It’s a counterfactual. The best we can do is speculate.

In software development, we use predictive analysis when creating roadmaps, scoping, and sequencing our projects. We can only speculate about how things would turn out if we chose different projects.

Causal analysis

Causal analysis uses controlled experiments to see both sides of a decision (i.e., a treatment and control group). This lets us make much stronger claims about the efficacy of a treatment than just predictive analysis.

Causal analysis is most well known in medicine for clinical drug trials. By comparing the outcomes of subjects in the control and treatment groups, we can be fairly certain that differences in outcomes were caused by the treatment.

In software development, we use causal analysis when we do A/B tests. Is the user more likely to click the button if we present information this way, or that way?

Causal analysis is not generalizable. Are the subjects in an experiment representative of the broader population? You need to re-do the experiment to find out. Say an e-commerce company finds that adding a big flashing “NEW” icon next to products increases sales. How long does that effect last? You’d have to do another A/B test to find out. Would it work as well for a different brand? Gotta do another A/B test.

Causal analysis is clever in that the mechanism by which the treatment works is irrelevant to the experiment. It doesn’t let us say why a new drug works, or why users click the button more; just that we have some degree of confidence that the treatment causes some desirable outcome.

Mechanistic analysis

Mechanistic analysis supports the strongest claims about why and how a system works.

It requires a thorough understanding of the mechanisms that make the system work. You can repeatedly apply the well-modeled system to new situations. It relies well-known scientific theories that are taken as axioms, and uses deductive reasoning to derive useful applications.

Some examples: When 19th century inventors sent electrical current through a carbon filament and discovered that the filament glowed, they discovered a physical system for producing artificial light. You can re-use that system: every carbon filament glows in predictable ways when you run electricity through it.

When rocket scientists calculate how much fuel they’ll need to launch a payload into orbit, or when civil engineers analyze whether a bridge design will support a given load, they use mechanistic analysis. Then answer is the solution to a mathematical equation.

They don’t build many bridges, then drive cars over them to see if they stay up. That kind of predictive analysis is insufficient for the challenge.

In software engineering, mechanistic analysis is best used when testing and debugging. If we can reliably reproduce a bug, or reliably verify the correctness of a system, it’s strong evidence that we’ve isolate a property of our system, and we’re well on our way to knowing how to modify the system to fix the bug.

Mechanistic analysis requires the most effort, but yields the highest confidence in our understanding of a system.

Conclusion

Enumerating these 4 types of analysis from low effort / weak claims to high effort / strong claims has helped me coach teams on a pragmatic depth of understanding. To decide if we should invest effort in a product change, we’ll use predictive analysis. To learn exactly how much the product change improved the user experience, we’ll use causal analysis. And to ensure we thoroughly fix any bugs along the way, we’ll use mechanistic analysis.


October 31, 2021 — Tags: mental models, management, leadership

Recently, I’ve found 2 leadership techniques particularly helpful. They’re both highly leveraged: they empower those around me to work effectively with less direct input and coordination from myself.

Completed Staff Work

Completed staff work is a rigorous “definition of done” when one needs to clarify and recommend a decision to someone else.

It’s useful when for the responsible person is different from the accountable person in a RACI matrix.

How I’ve applied it

When my colleagues present a decision that’s incomplete, my instinct is to do the additional work myself and model what I think is missing. Completed Staff Work helped me recognize that I overplay that Do-It-Myself technique (often at the expense of other priorities), and can instead give clear feedback about what’s missing

Commander’s Intent

Commander’s intent is about explaining “why” a particular task or instruction is important. Then If the particular task becomes unrealistic, knowing the intent allows others to creatively and independently solve the problem with less input from the “commander.” I think of it as the “spirit of the law” rather than the ”letter of the law“.

Knowing the intent empowers operators to better improvise and improve on rote instructions.

How I’ve applied it

I didn’t realize it at the time, but I used commander’s intent to evolve Glossier’s incident response process. We wanted to better detected and mitigate incidents. And we wanted to prioritize remediation work that would eliminate several types of failures. When this work inevitably ran up against product feature tradeoffs, the team was able to navigate those tradeoffs well referring back to the explicit intent of our incident response process, namely to continuously improve product quality and team productivity.

I encourage other leaders to use ‘completed staff work’ to teach their team to make clear decisions, and use commander’s intent to allow individual autonomy while staying aligned with the group.


October 20, 2021 — Tags: incidents, site reliability

I appreciate reading stories of how complex software systems fail and the hard-earned lessons to make them more resilient.

Here is another of my personal favorite incident anecdotes. It sticks out because it helped broaden my thinking about appropriate uses for A/B tests.

Key lesson

A/B tests can be useful to verify that a software system works correctly—they’re not just for testing the user experience.

Setting the scene

My team had spent about 8 months rewriting our e-comm frontend from dynamic, backend-rendered HTML to a statically rendered frontend (called SSG for static site generation). The main goal of the project was to make our site more scalable (by reducing the need for backend rendering and DB queries), and reduce latency.

We began QA’ing the new SSG version of Glossier behind a feature flag with fancy Cloudflare routing config.

In order to quantify the revenue impact of the project, leadership requested we do an A/B test on the conversion rate.

The team and I were initially reluctant, since an A/B test for this particular infra migration required one-off support in our Cloudflare Workers. We hadn’t planned to A/B test SSG because it wasn’t an optional feature — we needed SSG for our Black Friday traffic.

But it’s fair to ask us to back up our aspirational claims with data. And boy were we surprised when the early A/B results showed SSG had a worse conversion rate than our slow, dynamically-generated control.

We dug in, and realized that almost no customers from the UK converted in our SSG treatment. That helped us pinpoint a typo in our localization code (en-UK instead of en-GB). This caused customers with a UK IP address to default to the US store. Confused, they’d bounce rather than change their locale in a footer widget.

Note that we’d certainly tested many international locales; but we’d tested it by manually changing our locale (which worked) rather than the geo-IP lookup that’s the default for most users.

We fixed the typo, re-ran the A/B test, and sighed with relief at a modest lift in the conversion rate.

The A/B test was useful for QA! It would have been more difficult and costly to find that typo had we launched without an A/B test.


September 26, 2021 — Tags: productivity, management

Two colleagues recently asked me for personal productivity advice. I suspect one of them even gave me the Disco compliment:

Aaron Suggs always gets projects over the finish line.

Unfortunately, there must be some misunderstanding because I don’t feel like I get an unusual amount done, nor am I particularly strategic about it. My motivating principle is to avoid constant anxiety.

So here’s some free advice from an unqualified amateur.

First, I’ll point to better resources:

  • Getting Things Done (GTD) — I first read David Allen’s GTD book in 2005. It has a lot of durable, influential ideas. Though the implementation of notepads can be updated for a smartphone era.
  • Atomic Habits by James Clear. I really liked the emphasis on mindset and environment in changing habits.
  • The Rise and Fall of Getting Things Done by Cal Newport. This is a modern, wide-lens perspective on GTD and the personal productivity domain.

Here are some productivity techniques that I’ve found useful:

  1. Touch-it-once: Once a task has your attention, try to see it through to completion so you don’t need to ‘touch’ it or think about it again. For example, when I check my mail, if there’s a bill, I open and pay it right away (or better yet set up auto-pay). Then I can recycle the bill. I never set it down nor remember to pay it later. It means checking the mail sometimes takes a few minutes, but it doesn’t generate future work or accumulate in piles.

  2. Ubiquitous capture: Make it easy to leave notes to your future self, whether by your bedside table late at night or first thing in the morning, at your computer, in the car, or anywhere. I use the Reminders app on Apple Watch (usually via Siri), iOS, and macOS. And I use Things app to organize complicated projects. I organize my reminders to notify me when and where I can act on them. E.g. say in an 11am meeting we make a decision I need to communicate to my team. And say I’m busy until 3pm. I’ll make a reminder to share the decision with the team at 3pm. I can relax knowing that my system will notify me when I’m able to act on it.

  3. Write down the next action. If you need to interrupt a task (see #1 for why this should be rare), leave notes to your future self to make it easy to pick up where you left off. What were your about to do? On a project, you’re either doing 4 things:

    1. Researching - understanding the problem
    2. Brainstorming - generating ways to solve the problem
    3. Communicating - getting approval/alignment, informing or training stakeholders about how it affects them
    4. Implementing - executing the work you brainstormed earlier.

    If you’re not doing one of those 4 things, stop, and ask yourself which one of those 4 things you should do to make progress.

  4. Be easy to follow: Write down your work process so others can imitate it. Put it in the first place you’d look for it in the future (code review comments, Jira ticket, wiki, etc). Share the checklist, notes, thought process that you went through. This feels like extra work in the moment, but pays off in the long-run.

  5. Know yourself. When do you focus best? What type of work is a chore that saps energy? Get the chores out of the way, and then treat yourself to the more enjoyable tasks. And don’t force yourself to be productive if you’re really not in the headspace for it. Focus on the work you’re able to do.

  6. Consider Satisficing vs maximizing: Ask yourself if this project benefits from a quick, low-effort satisfying and sufficient solution (i.e. satisficing), or a high-effort maximizing solution. Most of the time, the answer is satisficing.

Those are six strategies that help me remember details and stay focused. Please let me know on twitter if you have any to share.


September 7, 2021 — Tags: site reliability, programming

Below is the mental model I use when designing or reviewing web services to fail gracefully. It’s useful to enumerate common failure types, and ensure engineers intentionally plan for them.

For each network call the service makes (to a data store or API), consider how it would would behave if:

  • the call immediately returns an error. Do you have appropriate error handling, instrumentation, and retry logic?
  • the call never returns. Do you have an appropriate timeout? Can you determine if the network call should be retried? Ruby’s Unicorn web server has a concise introduction of application timeout considerations.
  • the responses are temporarily incorrect. Do you have logging and instrumentation to figure out which data are affected?

By addressing these 3 questions, you’ve built a solid foundation for an reliable, maintainable web service.


August 5, 2021 — Tags: values, management

Every team has values that guide their work. Of course, I like to write down some aspirational values, preferably in a charter.

Here are some of my favorites over the years:

Be exemplary: Our work, process, and demeanor should serve as an example to our coworkers, and to the industry in general. We aim to move the organization forward.

Blameless self-reflection: No one naturally likes talking about their failures, delays, or lack of understanding. Yet doing so is safe, healthy, and necessary to do better tomorrow. Strive to only make new mistakes.

Be approachable: Work collaboratively with other teams. Share responsibility and model a generative culture. Embrace and guide others scratching their own itch to do so in a way that benefits the whole team.

Explain why, and ensure it’s safe to ask why: Be transparent and explicit about how we prioritize, scope, and implement projects. Documenting the reasons and context for these decisions lets us easily adapt when we encounter new circumstances, and quickly onboard new team members.

Embrace experimentation with limits to avoid tech debt: Balance the virtues of consistent, familiar tools with innovative experiments. Experiments are the engine of innovation and continuous improvement.

And especially for platform/SRE teams serving internal customers:

Pave the common path: Common tasks should be easy (or automated!) and well-supported.

Empower our colleagues: Build simple, reliable processes and tools. Avoid being a bottleneck with self-service tools so others may help themselves.


July 28, 2021 — Tags: incidents, site reliability

I appreciate reading stories of how complex software systems fail and the hard-earned lessons to make them more resilient. In fact, one of my favorite software interview questions is “tell me about a time you were involved in a production incident.”

Here is one of my personal favorite incident anecdotes. It sticks out because of the cognitive bias that slowed our diagnosis, and how thoroughly we were able to prevent similar incidents in the future.

Key lessons

  1. It’s useful to reconfirm basic assumptions if Subject Matter Experts are stumped.
  2. Listen to all the voices in the room.
  3. Thorough remediation means mitigating future failures in multiple independent ways.

Setting the scene

It was early 2015 at Kickstarter. Our Rails app used 3 memcached servers running on EC2 as a read-through cache. We were expecting a high-visibility project to launch in the coming days, so per our standard practice, we scaled up our unicorn app processes by 50%. In this case, that meant going from 800 to 1200 unicorn workers.

In prior months, we’d been battling DDOS attacks, so I was primed to expect unusual app behavior to be a new type of abusive traffic.

The incident

Out of the blue, our team was paged that the site was mostly unresponsive. A few clients could get a page to load within our ~60 second timeout, but more often clients got a 504 gateway timeout error. Several engineers, including myself, joined our incident slack channel to triage.

Digging into our APM dashboards, we saw that the public stats page saturating our database CPU with slow queries, which meant our unicorn web workers hung while waiting on DB queries to render pages.

That was strange because while the stats queries are slow, we kept the cache warm with a read-through and periodic write-through strategy. If the results fell out of cache, the page should hang for just a few seconds; not cause site-wide impact for several minutes.

“It’s as if memcached isn’t running,” said one prescient engineer. I ignored the comment, too deep in my own investigation. Memcached doesn’t crash, I thought. It must be our app bug, or some clever new denial-of-service vector to generate DB load.

After roughly 40 minutes of fruitless head scratching, the prescient engineer piped in, “I ssh’ed into one of the cache servers, and memcached isn’t running.”

If we’d had an Incident Manager role, we’d likely have checked memcached sooner.

Biggest. Facepalm. Ever.

The fix

Moments after we confirmed memcached wasn’t running, we restarted it with /etc/init.d/memcached restart, and the site recovered within a few seconds.

With the incident mitigated, our investigation continued. Why wasn’t memcached running? Our cache cluster had been healthy for years. The EC2 hosts were healthy. Yet each memcached process had crashed in the past few hours. Only in retrospect did we observe that the site was slightly slower as the first 2 crashed. We certainly noticed the near-complete outage when the final process crashed.

Digging through our app logs, I noticed sporadic connection errors to memcached. Apparently, we still had the default ulimit of 1024. So when we scaled to 1200 app workers, only 1024 could connect, and the remaining 176 would get errors. The Ruby memcached client would automatically attempt to reconnect every few seconds.

I was still puzzled why memcached had crashed, so I searched through the code commits for anything mentioning “crash.” And eureka! This commit mentions exactly our issue: as clients connect and disconnect when memcached is at the ulimit’s max connections, a race condition can crash the server. The default version of memcached that came with our Ubuntu version happened to predate the fix. I was able to reliably recreate the crash in a test env.

With all this in hand, the team implemented several fixes:

  1. I ported the default init.d script to runit, our preferred tool at the time, to automatically start processes if they crash. This would make the impact of the crash negligible.
  2. We increased the ulimit to accommodate more workers. This improved latency because ~15% of our workers were effectively without cache.
  3. We upgraded memcached to patch the ulimit issue.
  4. Send an alert if memcached isn’t running on a cache server to reduce our time-to-detect.

Items 1-3 are each sufficient to prevent this particular memcached crash from having a significant impact on our customers.

This was the first and only incident with memcached crashing in my 7 years at Kickstarter.

Wrapping up

This incident taught me to be a better listener to all the voices in the room, even if it means questioning assumptions that have served me well before.

And it taught me to be tenacious in tracking down causes for failures, rather than stopping at the first sufficient mitigation. Reading the source code can be fun and rewarding!


June 18, 2021 — Tags: vendors, management

I published Guiding principles for build vs. buy decisions on LeadDev as part of their Technical Decision Making series.

Here’s the conclusion:

I encourage you to consider ‘build vs. buy’ primarily from the lens of whether the opportunity merits a long-term strategic investment of your team’s attention, and less from the lens of short-term financial cost. Build if there’s an opportunity to make a significant improvement on the state of the art and create a competitive advantage for your organization. Buy it otherwise. And be ready to discard your competitive advantages of yesteryear as better alternatives emerge.

Together with Choosing software vendors well, it feels like a coherent a strategy.


June 2, 2021 — Tags: books

Here’s a log of audiobooks I’ve listened to recently, with some notes.

Nonfiction

Sapiens: A Brief History of Human Kind by Yuval Noah Harari

This was a wonderful book. Some of my favorite points:

  • Corporations and nations are a collective fiction like religions.
  • Money is an uniquely valuable technology because it transcends culture.
  • Agriculture and the Neolithic revolution changed humanity in harmful ways, increasing the likelihood of violent conflict and poverty (reminiscent of Ishmael by Daniel Quinn).
  • The chapter on happiness and Zen Buddhism gave me galaxy brain.

The Essential Drucker by Peter F Drucker

A wide-ranging collection of insights on business management, many from the early-mid 1900s.

Good to Great by Jim Collins

I read this because it’s popular among Glossier leaders, with frequent references to a ‘flywheel’ and ‘getting the right people on the bus’. I found it quite valuable.

Some notes:

  • Humility and egolessness are critical leadership skills.
  • You are more likely to get revolutionary results from an evolutionary process than a revolutionary process. I.e. evolving a process is like compound interest.
  • Opportunity selection is more critical than opportunity creation.

Inspired by Marty Cagan

I read this b/c an Eng/PM friend recommended it when I confessed a lot of role confusion amongst PMs, Eng Managers, and tech leads. It’s a good primer on what Product Management should be. I particularly appreciated the emphasis on finding reference customers as a symbiotic partnership.

Doughnut Economics by Kate Raworth

A thought-provoking exploration of an economics that doesn’t assume indefinite growth. She argues that systems thinking (stocks and flows) are much more helpful to economics than trying to discover physics-like natural laws and constants.

Algorithms to Live By by Brian Christian and Tom Griffiths

I especially liked applying the multi-armed bandit approach to explore/exploit trade offs in everyday life (like whether to try a new restaurant).

Structure of Scientific Revolutions by Thomas S Kuhn

I re-read this for the first time since college. One point that really stuck out what that work on novel paradigms is often accessible to a non-academic audience. Examples were Newton’s Principia and Darwin’s Origin of Species. In contrast, as a paradigm is well-established, academic work becomes deeply niche and inscrutable without decades of training.

Turns out, the hard sciences are more subjective than we realize.

Fiction

  • His Dark Materials trilogy by Philip Pullman
  • The Broken Earth trilogy by N. K. Jemisin
  • The Yiddish Policemen’s Union by Michael Chabon
  • Project Hail Mary by Andy Weir. Rocky! I liked The Martian and Artemis. This is my favorite of the three. Weir has really found his groove.
  • Death’s End (Remembrance of Earth’s Past trilogy) by Cixin Lui

May 23, 2021 — Tags: management, release engineering, productivity

There’s a crucial moment in platform engineering projects when you decide it’s ready to ship. For large projects (say, more than 1 year of engineering effort), it can be a difficult decision for the team. More cautious contributors want to delay for further testing and polishing. Other teammates inevitably begin to shift their attention to their next project, and are eager to move on.

I’ve found a simple criteria for navigating the risk/reward trade-off for launching a complex project:

Ship when the project is an improvement on the status quo.

If current engineering risks make it uncertain that it’s an improvement, continue testing and fixing defects until it’s clearly an improvement.

And all those extra features on the backlog: you can still build them, but it’s not worth withholding the valua you’ve already created while you do so.

I’ve found this to be particularly useful for architecture migrations or component rewrites where achieving ‘feature parity’ with a deprecated implementation or ‘feature completeness’ of our ideal product is difficult or not worth the opportunity cost. Agreeing to ship a feature when it’s a net improvement over the existing solution ensures our team delivers value as quickly as possible, and helps us focus our effort on the most impactful work.


May 16, 2021 — Tags: programming, release engineering

My team has been discussing the role of various test and development environments. We’d like to provide guidance for what developers should test locally on their laptop, on an ad hoc deployed environment, on a pre-prod environment, and on production.

I’d like to share some criteria that help me organize the value and purpose of various environments.

Let’s start with 3 key features of an ideal dev env:

  • Fast feedback. It should be as quick as possible to change some code in your editor and see the effect in your dev environment.
  • Similar to production. The environment should be maximally similar to the full production system. There are several aspects in which it may be similar or different, such as the infrastructure it runs on, configuration, data, and traffic load.
  • Isolated. Any side effects and failures should be isolated from impacting actual users, or even your colleagues (e.g. minimize the time that one person breaking the build blocks teammates).

In practice, fast, similar, and isolated aren’t so much features, but continuous dimensions that we try to maximize. We can carve out roles for various dev envs by considering the relative importance of each dimension.

Local development

For local development environments (i.e. code running on your laptop), I’d rank the importance as follows:

  1. Isolated
  2. Fast
  3. Similar

In other words, it’s most important that local envs are isolated from breaking anything on production or anyone else’s environments. The 2nd priority is fast developer feedback as long as it doesn’t compromise isolation. And the 3rd priority is being production-like, as long as it doesn’t compromise isolation or fast feedback.

A feature like Webpack’s Hot Module Replacement and React Hot Reloading improves feedback time, but detracts from being production-like. So that’s a win for local development since ‘Fast’ is more important the ‘Similar’.

By similar reasoning, local development is a good place to run uncommitted code, or dynamically generating assets that would be immutable deploy artifacts on production.

Testing on production

What about practices that let you more safely test on production, like feature flags and blue-green deployments? I see the ranking as:

  1. Similar
  2. Isolated
  3. Fast

‘Similar’ is de facto top priority since it is production. Next up, our goal is to isolate failures and unintended side effects as much as possible. And finally, we want fast feedback as long as it doesn’t compromise isolation.

Other deployed environments

Where does that leave environments like staging, QA, or other quasi-production like environment? For decades, they’ve been a middle-ground between local development and production.

As release engineering and local development tooling improves, I’m finding fewer reasons to maintain them. More likely, I’m going to invest in ways to gain confidence in my code locally, or build ways to safely test it on production.

Let’s recall the aspects in which an environment can be production-like: infrastructure (as in the CPU and memory resources, operating system, and system libraries), configuration, data, and traffic.

Years ago infrastructure and configuration were a frequent sources of bugs. Code might work on a developer’s macOS laptop, but not on Linux server. Or we forgot to set all the environment variables we expected. Staging environments were a critical place to suss out those bugs. Lately, Infra-as-code tooling and better configuration patterns like Terraform, CloudFormation, and Docker have made these rare issues.

Most bugs I see on production today are related to data (i.e. unexpected states) or traffic (unexpected resource contention or race conditions). Those are particularly difficult to suss out in non-production environments.

Sometimes creating these non-production integration environments means adding and maintaining new code paths. For example, for Stripe’s sandbox environment, Stripe maintains different payment cards that always succeed or return predictable errors. That’s unique behavior to the sandbox environment not found on production. In order to be useful for isolated testing, they had to compromise on being similar to production. As I think about a constellation of microservices that could make up a complete test environment, the support cost of these alternate code paths can add up quickly.

For SRE / Platform / Release Engineering teams tasked with supporting developers on the entire delivery lifecycle, we must choose where our attention can have the most impact for the organization. I’m finding that ever more often the focus in on fast local development and safe production releases, and there are fewer reason for maintaining non-production deployed environments.

Check out ”The value of reliable developer tooling” for some of my prior work on dev envs.


March 27, 2021 — Tags: capacity testing

A friend recently asked how to set better capacity testing goals for their tech team. We agreed we needed to get more specific about what “capacity testing” means.

Below are the key terms I’ve found helpful. I go into more detail about these in my talk Surviving Black Friday (particularly at the 8:00 mark).

Expected traffic

This is a plausible traffic volume based on prior data and forecasts. I encourage this number to come from cross-functional consensus from Eng, Data, Marketing, Product, Sales, etc. Express it as a rate like reqs/sec or orders/min.

If the expected traffic is too low, the service could crash from insufficient capacity. But if it’s too high, you’ve wasted scarce engineering effort building capacity you don’t need (analogous to excess inventory in a supply chain). Site downtime is usually more costly than wasted engineering effort. The virtue of highlighting wasted engineering effort is to recognize the marginal effort and opportunity cost for engineers to support greater capacity.

Safety factor

This is the multiplier or fudge factor to give yourself breathing room. I’d suggest 20-50x for early-stage startups that don’t know when they’ll go viral, 5-20x for growth stage businesses, and <5x for mature businesses with robust prior data and tightly-managed sales/marketing plans. At Glossier, we currently use a 10x safety factor. We were bitten in 2018 with a 5x safety factor and insufficiently detailed traffic forecast.

Capacity target

This is what you’re aiming for. capacity target = expected traffic * safety factor

So assuming expected traffic of 500 req/sec, and a safety factor of 10x, your capacity target is 5,000 req/sec.

Demonstrated capacity

This is the load that your engineers have proven your system can handle during their load tests. Keep scaling and removing bottlenecks until your demonstrated capacity exceeds the capacity target.

Pro tip: run your load tests for several minutes (we do an hour) to “soak” your infrastructure (kudos to Rajas Patil for introducing this idea). This can reveal bottlenecks that don’t show up in quick tests. For example, at Glossier, the data replication from our customer-facing DB to our BI data warehouse was significantly delayed during a 60-minute soak tests, but we wouldn’t have noticed during a quick 5-minute test. By detecting the replication delay early, we had time to mitigate it.


March 14, 2021 — Tags: management

The Glossier Tech team wrapped up our annual roadmap exercise earlier this year. It takes a lot of time and attention from the team, especially managers.

I wanted to share some tips I’ve gleaned to make reviews easy and productive. They’re organized into ‘filters’, or questions to ask about each project in a roadmap.

If product and engineering managers can speak to each of these filters, they’ll likely have a smooth review with no surprises.

Virtually all the questions and feedback that came up in our roadmap reviews fall into one of the filters below; and each one is a hard-learned lesson from watching my projects or teams stumble.

1. Sufficiently detailed

The appropriate level of detail increases during the roadmap process. In general, sufficient detail means that project outcomes and requirements are defined, and that key decisions and risks are highlighted and investigated. A 6-month project may not have a clear approach at the beginning of the road mapping process, but by the end it would likely have specific, realistic outcomes for each 2-week sprint.

Having documented examples of projects plans with the appropriate detail is helpful here (see Will Larson’s Discouraging Perfection). Some people take roadmapping too seriously, going into so much detail that the precision of their plan exceeds it’s accuracy. They get frustrated when they need to adapt to the unexpected. Others can be too casual or hedge so much that it’s difficult for others to depend on them. The key is the psychological safety to acknowledge that plans are imperfect and will inevitably change. The point is sufficient detail to reduce risks, not complete and rigid precision.

2. Aligned with business goals

Are these projects sufficient to meet the team’s mission and biz goals? If not, change up the projects, or set more realistic goals. For example, if a goal is to increase a conversion rate by X% this year, but the projects to improve conversion ship at the end of the year, they likely won’t have a significant impact on the conversion rate and there’s little time to respond.

3. Comprehensive of all work

Does this roadmap account for all the work the team will have to do? If a team spends 20% of their time responding to bugs filed by the customer support team, that should be accounted for in the resource planning. We call this Keep The Lights On (KTLO) work.

4. Sequenced effectively

Which projects have strict deadlines? Do Team A’s projects depend on one of Team B’s projects? Does Project X become easier or more valuable if we do Project Y first? A group roadmap review is one of the more obvious places to suss this out.

5. Resourced for success

Does the team have appropriate people and skills to deliver each project? What skill gaps or “bus factors” are there? What’s the plan to get those skills (hire, train, or borrow)?

6. Iterative milestones

Can you frontload more business value? I.e. be more agile and less waterfall. Are there narrow customer segments or journeys that you could support early on while you develop the rest of the project? Are there milestones that de-risk the project and enable real feedback as soon as possible?


Having presented and reviewed several roadmaps, I’ve found these filters to be a helpful linting tools to make more useful roadmaps. Or at least they allow me to learn new ways to fail rather than repeat my previous mistakes.


March 10, 2021 — Tags: management, productivity

While reviewing how I’ve spent my time recently, I stumbled into a practice to better ensure I have sufficient flexible time for serendipitous projects. I’ll aim to schedule a max of ~80% of my time for inflexible work like group meetings.

The practice was inspired by ”hara hachi bun me”, the Confucian practice of eating until you’re 80% full.

Dysfunction of the over-booked calendar

What’s the harm in scheduling every minute of your day? If appointments are hard to move, it adds friction to say ‘yes’ to unexpected opportunities. I found myself disinclined make time if it had administrative overhead like rescheduling meetings and delaying project timelines.

Applying some lean production theory, as your schedule becomes 100% utilized, the wait-time for any new task approaches infinity.

Remove friction to optimize your schedule

The solution is me noticing as my schedule fills up, I’ll more aggressively block off flex time on my calendar. To be sure, I find it very useful to be intentional with every minute of my schedule. Adrian Cruz describes this well in The Power of Quiet Time. So while I may have an hour or two blocked off for writing docs or making a prototype; I consider that flex time becaue there’s low friction to reschedule that time to help debug a complex issue, or have impromptu discussions.

Some flex time in your calendar makes it easy to say ‘yes’ when opportunity knocks.


February 18, 2021 — Tags: books

In mid-2020, I got an Audible subscription as a substitute for doomscrolling through social media.

It turns out I enjoy listening to books far more than reading them. Here are some books I enjoyed in the last 6 months (follow my my Goodreads profile for more):

Nonfiction

  • The Machine That Changed the World by James P Womack. One sentence book review: Rigorous insights into Japanese lean manufacturing and keiretsu; emphasizing the success is not due to a national or cultural identity, but a set of practices thoughtfully applied.
  • Accelerate: The Science of DevOps and Lean Software by Nicole Forsgren: software teams should focus on deploy frequency, lead time, TTR, and change fail rate.
  • An Elegant Puzzle by Will Larson: bring expansive and systematizing mindset to every technical and management challenge; then work the process (not the exceptions).
  • The Goal: A Process of Ongoing Improvement by Eliyahu M. Goldratt: identify and eliminate bottlenecks to improve throughput. Bottlenecks can be subtle or unintuitive.
  • Don’t Think of an Elephant by George Lakoff: Controlling subtle and implicit metaphors has huge leverage to frame political debates. Personal values can often be grouped into a “dominant father” or “nuturant mother” mindset. (With Sapiens, I’m realizing this parallels chimpanzee and bonobo social heirarchies as well.)
  • Reimagining Capitalism in a World on Fire by Rebecca Henderson: the current corporate rules and norms undermined the long-term health of society, so leaders should advocate to change the rules for healthier incentives.
  • This Could Be our Future by Yancey Strickler: having an expansive and long-term notion of value (beyond, say, money) clarifies purpose.
  • Lives of the Stoics: The Art of Living from Zeno to Marcus Aurelius by Ryan Holiday. Stoics are surprisingly varied and relatable. I particularly appreciated the portayal of Seneca as a flawed moderating influence on a corrupt leader.

Fiction


February 14, 2021 — Tags: bento, values

In January, I joined the Bento Society as a weekly practice in long-term thinking.

The society is born of Yancey Strickler’s book This Could Be Our Future. Bento stands for Beyond Near Term Orientation, and a play on the neatly separated Japanese lunch tray.

In it’s simplest form, the Bento is a square divided into quadrants; with the x-axis being time (now and the future) and the y-axis our self-interest (me and us).

blank bento

One powerful application is to use the quadrants to tap into important parts of your identity. Write a question like “what should I do today?”, and envision how each quadrant would answer.

'what should I do today?' Bento

“Now me” (your short-term self-interest) might want to binge watch Netflix, or knock out a work project that’s been on your mind.

“Now us” (your short-term group-minded self) might want to talk with a family member going through a hard time, or reconnect with an old friend.

“Future me” (your long-term self interest) might want to work on a passion project, or practice a new skill.

“Future us” (your long-term group-minded self) might want to apply a new skill in a way that benefits your community.

All too often, I find that the “now me” gets to drive my life. Thinking through the Bento quadrants helps me balance near- and long-term interests; and balance self-care and service to others. It’s not about judging certain quadrants as good/bad or right/wrong; simply that no one quadrant is the complete picture of what matters.

After doing several Bentos, the exercise highlights the values and guiding principles that I want to more thoroughly practice, like curiosity and compassion.

Participating in the Bento Society has been helpful way to ground and orient my values and daily habits.


February 4, 2021 — Tags: vendors

If you follow the strategy of avoiding undifferentiated heavy lifting, you’ll inevitably integrate with many software vendors. For example, my e-commerce software team at Glossier has vendors for our cloud hosting, CDN, CMS, payment processing, CI/CD pipelines, code hosting, internal docs, observability, and alerting to name a few.

Here are a few criteria I’ve found particularly valuable when choosing vendors.

1. Emphasize rate of improvement over current feature set

Prefer a vendor that’s sufficient today and improving quickly over a dominant-yet-stagnant vendor. I judge vendors’ rate of improvement by their recent feature announcements and possibly a roadmap presented by an account rep. In other words, consider which vendor will likely have the best product ~2 years from now; not just the product as it exists today. Skate to where the puck is going.

This criteria is particularly helpful to compare small, disruptive innovators with current market leaders.

Of course, if the current market leader is also innovating quickly, you’re lucky to have an easy decision.

2. Backchannel referrals / Ask an expert

In my StaffEng interview, I shared this anecdote:

Our team was recently choosing a new vendor and the team was split between two mediocre choices. I asked an acquaintance with expertise about the vendors how he would choose; and he recommended a lesser-known new vendor that quickly became a universal team favorite.

To expand on this example, it was an an area that our team had little expertise. It was difficult for us to determine what features really matter; and set realistic expectations. Asking experts in your professional network can bring clarity and confidence.

3. Emphasize net value over cost

From the Vendor relationship HOWTO:

The goal is to maximize our organization’s long-term value from the vendor’s service.

In contrast, I’ve sometimes seen teams try to minimize cost, ignoring gross value. This is short-sighted.

Suppose Vendor A costs $25k/yr and adds $200k of gross value to the org ($175k net value); while Vendor B costs $100k and adds $500k of gross value ($400k net value).

Choose Vendor B because of the higher net value, even though it’s more expensive than Vendor A.

To be sure, I don’t know how to assess the gross value derived from any vendor beyond a hand-wavy estimate. Here are some techniques I use; though I’d certainly like to learn more.

One technique is to look at productivity improvements. If a tool saves each engineer 1 hour per week, it’s a 2.5% productivity improvement; so it’s gross value is roughly 2.5% of your total Engineering payroll.

Other times vendors add capabilities or controls that change how the team works, so you can’t easily assess productivity. In this case, speculate about how much value your org gets from that capability or control. E.g. an A/B testing tool adds the capability to rigorously measure the impact of product changes. The gross value is the difference between the product features you ship using A/B test feedback versus product features you would have shipped without A/B test feedback. Security tools add controls that constrain types of risk. The gross value is difference from the liability of the unknown/unconstrained risk versus the better-known/constrained risk.


January 25, 2021 — Tags: vendors

Creating and sustaining vendor relationships can be a highly leveraged skill for software engineering teams. But there’s little guidance or structure at small companies for folks learning to build vendor relationships.

So here’s the template of bare essentials and some nice-to-have responsibilities to steer emerging engineering leaders in vendor management. This is born from experience at tiny startups to growth-stage companies with hundreds of employees. Larger companies have more formal processes for choosing and managing vendors.

This post won’t go cover how to choose a vendor and the famous “build versus buy” calculus; instead focusing on what to do after you’ve chosen a vendor.

Without further ado, the template:


The goal is to maximize our organization’s long-term value from the vendor’s service. That means we use their service appropriately, and spend money efficiently.

Minimum essentials

  1. Each vendor should have a Directly Responsible Individual within the org. The DRI is responsible for the items below.

  2. Follow our org’s legal review process. Before you accept terms of service or sign anything, familiarize yourself with your company’s signing authority and approval process. In short, give our legal team a heads up; and they can help navigate contract discussions, particularly around liability and data privacy issues.

  3. Follow our org’s billing process. Give our accounting team a heads up to coordinate who keeps track of invoicing and receipts. Very small companies tend to use corporate charge cards. As they grow, it tends towards invoices and purchase orders with formal approval processes.

  4. Know how to contact the account rep, escalate tech support tickets, or otherwise get high-quality, timely technical assistance. Preferably, this contact info is stored in a well-known, discoverable place for all vendors. (We use Blissfully.) This is particulary important for business-critical vendors like payment providers and CDNs.

  5. Keep payment information up-to-date to avoid service disruptions; and make sure invoices are approved/paid on time. Check your emails!

  6. Use a vendor-specific email list like vendor-name@mycompany.com for all communication with the vendor. As our team grows and we onboard new member, they can easily review and join discussions. As the DRI, you’re responsible for staying on top of this email list.

  7. Ensure money is spent effectively. Should we change our terms to reduce our bill (like commit to a larger quota to reduce overage charges)? For large contracts (>$15k/yr), negotiate with the vendor (the finance team can help with this).

  8. When contracts are expected to change or expire without renewal, inform stakeholders with ample time to implement alternatives.

  9. Ensure the process for onboarding and offboarding employees with the vendor is documented clearly.

  10. Maintain a list of the PII and sensitive information that’s shared with the vendor. Your legal team can help ask the right questions here.

Nice-to-have strategic considerations

Here are some next-level ways to derive significantly more value from your vendor relationship:

  • Maintain a clear sense of the value this vendor provides the organization. Tech vendors typically use value-based pricing (as opposed to cost-based pricing), so being able to describe the value of various features ensures you and the account rep speak the same language.
  • Track how closely our usage aligns the vendor’s typical customer usage. Do we use their service in a common, expected way; or in a custom, unusual way that could be a strategic risk as the vendor evolves? Are we one of their biggest/smallest customers (another strategic risk), or middle-of-the-pack?
  • Maintain a general sense of the competitive landscape and alternatives for the vendor. What’s our next best alternative if we had to move off this vendor? Are there competitors who have a superior service or are gaining quickly? When would it be worth the opportunity cost to build it ourselves?
  • Track and contribute to the vendor’s private roadmap (beta features). Usually the account rep will offer to discuss this once or twice per year.

Congrats, you’re well on your way to a productive, valuable vendor relationship!


January 21, 2021 — Tags: leadership, management

This interview originally appeared on StaffEng. I wanted to share it here as well.

Tell us a little about your current role: where do you work, your title and generally the sort of work that you and your team do.

I work at Glossier, a direct-to-consumer growth-stage skincare and beauty company with incredibly passionate customers. Our engineering team is ~35 people. I’m a Principal Engineer, mostly focusing on our Site Reliability and Tools team. My recent focus has been leading Glossier’s Operational Excellence initiative (nicknamed ✨GLOE✨) and ensuring we’re building scalable services and team practices. I define operational excellence as our ability to deliver low defect rates, high availability, and low latency for product features. In practice for the SRE/Tools team, that means improving observability, increasing our infra-as-code adoption, and shepherding our migration from a monolith to microservices.

In the Staff Eng Archetypes, I gravitate most towards being a right-hand, and secondly a solver.

Prior to Glossier, I was a Director of Engineering at Kickstarter. In 2018, I joined Glossier as a Senior Staff Engineer (an IC role), and as the first engineer to focus primarily on internal tools and engineering practices. My first projects were building a feature flag system so we could safely and easily test features with real data; then implementing continuous deployments to accelerate delivery.

After a few months, I switched back to management to lead a new Platform team and prepare for Black Friday. Glossier has an annual Black Friday sale that generates a huge spike in traffic and revenue, and our ambitious growth targets showed we need to rigorously prepare with capacity testing, system hardening, and cross-functional collaboration (See Surviving Black Friday: Tales from an e-commerce engineer for details on Glossier’s Black Friday prep). After some re-orgs, the Platform team wound down, but the current SRE/Tools team does similar work. A year ago I gave up my management responsibilities to more deeply focus on operational excellence.

Did you ever consider engineering management, and if so how did you decide to pursue the staff engineer path?

Absolutely! I’ve switched from manager to IC twice in my career; and I’ll likely do so again.

When I first became a manager in 2015, it was the only career path for a senior engineer at my company. Fortunately, ever-smaller engineering teams soon created and shared career ladders with parallel IC and management tracks. When I helped create Kickstarter’s engineering ladder, I emphasized IC growth paths that didn’t require people management.

I was deeply influenced by a section of Camille Fournier’s Manager’s Path that called out “empire building” as a toxic management practice. It reminded me of the argument in Plato’s Republic that the political leaders shouldn’t be those that selfishly seek power, rather those whose wisdom makes them duty-bound to lead.

So I don’t orient my career around ever-greater management responsibilities: it’s one tool in the toolbox. I appreciate management as a rich discipline that I’ll spend my career honing; alongside programming and systems engineering.

Here are some important factors for me when switching between manager and IC roles:

  • What skills does the team need most acutely: management to coordinate the actions of a group; or an IC to accelerate the execution?
  • Will I have sufficient support and feedback to learn and succeed?
  • Am I the only one on the team who could do this; or could others do it well?

Can you remember any piece of advice on reaching Staff that was particularly helpful for you?

“Replace indignation with curiosity.”

Several years ago, I told my manager about another team behaving in a way that caused problems for my team. When I finished, he gave me that advice. I hadn’t been curious about why the other team was acting that way. It turned out they had constraints that made their behavior quite reasonable. By approaching them with curiosity and a helpful mindset (instead of frustration), we quickly found a process that improved both our workflows.

More recently, while struggling with burnout, a career coach asked me, “What would let you approach each day with energy and optimism?”

It’s become my morning mantra, ensuring that I make time for operational excellence and mentorship and bring genuine enthusiasm to my work.

How do you spend your time day-to-day?

My days are roughly 50% scheduled meetings, 35% deep-focus blocks, and 15% unplanned work.

I work hard to make sure the meetings are effective. That usually means at least having an agenda. The meeting should have a clear purpose known to attendees beforehand, such making a decision, generating ideas, or reviewing information. Meetings often have a negative connotation because they’re facilitated poorly; but they can be incredibly productive. I try to get better at facilitating productive meetings and using synchronous attention well. High Output Management by Andrew Grove is a great resource to learn about effective meetings.

A technique I recently learned from my CTO is to schedule reading time at the start of a group meeting. Say you’re in a hiring debrief: everyone spends the first 5 minutes reading each other’s feedback about the candidate. It’s a great way to ensure attendees truly read the document and have it top-of-mind. It ultimately saves time and elevates the subsequent discussion.

I also interview quite a bit. In 2020, I did (checks calendar) 126 interviews. Improving the long-term health of the team is a key Staff+ responsibility; and helping us hire great people is part of that.

The deep-focus blocks are marked off on my calendar. My company observes “No Meeting Thursday” which helps a lot. I use these blocks for work that’s ‘important but not urgent’ from Eisenhower’s productivity matrix. That’s usually writing specs and documentation, or researching and prototyping new tools and patterns.

My schedule is unusual in that I stop work around 4pm most days, then work later in the evenings, ~8-10pm. This gives me several high-quality hours with my family each day. I have difficulty concentrating in the afternoon, and can more easily concentrate at night. And I enjoy getting something done right before bedtime. So this schedule has improved both my work/life balance and productivity. I changed my schedule because of childcare needs during the coronavirus pandemic; but I think I’ll keep it long-term. I encourage everyone to reflect on what habits and schedules are helpful for their work. An open discussion with your manager and some flexibility can go a long way.

The unplanned work is mostly answering Slack messages, advising on urgent issues, or sometimes responding to a production incident. I try to approach this work with a helpful attitude, and also with an eye towards cross-training and writing discoverable documentation to minimize future unplanned work.

Where do you feel most impactful as a Staff-plus Engineer? A specific story would be grand.

I think of my impact in two ways:

  1. Working the plan
  2. Serendipity

‘Working the plan’ is about making daily, incremental progress on a big project with a team. Some examples have been improving our site availability from under 99% to over 99.95%. It took a lot of Learning Reviews (blameless postmortems), training, testing, and refactoring. Another was a 9-month migration from dynamically-generated Rails-based HTML pages to statically-generated React-based ones to improved time-to-first-byte and availability. It took a lot of coaching, buy-in, and coordination. To successfully work the plan, you need clear goals and incremental milestones to keep the team motivated, and continuous alignment with leadership on the desired outcomes and timeline.

‘Serendipity’ in my work is about sharing an insight with the right people at the right time to make a positive impact. For example, our team was recently choosing a new vendor and the team was split between two mediocre choices. I asked an acquaintance with expertise about the vendors how he would choose; and he recommended a lesser-known new vendor that quickly became a universal team favorite.

Another serendipitous example was an engineer mentioning during standup that a caching optimization wasn’t having impact they expected. I happened to be familiar with the config options of the particular Ruby web server; and was able to interpret some complicated metrics on a dashboard they showed to determine we had misconfigured a memory threshold. Later that day, we made a one-line config change to optimize our memory usage that reduced latency by 30%.

Serendipitous impact isn’t planned; and isn’t necessarily hard work. It’s about paying attention (being present), keeping a curious mindset, and sharing the insight in a way that colleagues are open to receiving.

How have you sponsored other engineers? Is sponsoring other engineers an important aspect of your role?

Certainly! As a Principal Engineer, I try to be an enthusiastic and conspicuous first follower when other engineers are doing important new practices. Some examples are when colleagues demoed React snapshot testing and local development with Docker. After each demo, I’d ask how I can try it out and see the benefits for myself. Then I’d look for other teams and in-flight projects where we can apply these practices to get wider adoption.

I also ‘cheerlead’: recognizing a colleague’s valuable effort in public or a small group, even if the outcomes aren’t tangible yet. It could be complimenting a team that’s was thorough and reflective during a difficult Learning Review; praising an engineer who reproduced a tricky race condition; or thanking someone who documented a poorly understood process.

I aim to serve two purposes with cheerleading: recognize those doing the valuable behavior, and give positive reinforcement in the hopes that the team does more of that behavior. It’s really operant conditioning, but cheerleading sounds much nicer.

What about a piece of advice for someone who has just started as a Staff Engineer?

Other engineers look up to you as a role model, some in ways you may not expect. They’ll emulate your coding style, your tone in code reviews, your behavior in meetings, your rationale for making decisions, and the way you treat colleagues.

It can feel like a lot of responsibility to be perfect all the time. But it can also bring clarity to your work: do your best, acknowledge shortcomings, be generous and curious.


January 18, 2021 — Tags: programming, documentation

A well-crafted GitHub pull request can be a powerful way to show others how to extend and maintain a component. These ‘Exemplary’ PRs highlight the code and practices you want others to emulate.

A few years ago, my Platform team was implementing a new GraphQL API. We found engineers needed a lot of support and code reviews to add new mutations in our app. One of our lead engineers used a new mutation as an opportunity to create an exemplary PR.

The exemplary PR for a GraphQL mutation showed:

  1. The new class to create and interface to implement
  2. How to register the new mutation with the server
  3. How to handle authentication/authorization
  4. How to validate the object and handle validation errors
  5. Instructions for how to test the mutation locally, what automated tests to create, and how to manage test state

It turned out to be highly leveraged effort! As we pointed engineers to the exemplary PR, they were able to easily create high-quality mutations while also needing less support from the Platform team.

Recently, I had the opportunity to help create another exemplary PR. Our SRE team wanted to make an easy process for Eng Managers to maintain their team’s PagerDuty on-call schedules using Terraform. We created a simple pagerduty_team module that only required a few parameters, like the name of the team and a list of emails of the on-call members. That way managers didn’t need to learn a bunch of Terraform provider details just to maintain their on-call rotations.

I worked with an EM to craft an exemplary PR, adding her team’s rotation, and being sure to add explanatory comments about how our CI/CD pipeline applies the changes. As other EMs asked how to set up their on-call schedule, we’d just send a link to that PR. It was obvious what values to substitute.

To be sure, we had more documentation about our Terraform setup; but making the PR the one-stop-shop ensured EMs could get their rotations set up in minutes without much reading or back-and-forth.

Engineers naturally look for similar code in a repository they can use as a starting point for new features. Creating and labeling exemplary PRs is a helpful way to highlight the code you want them to emulate.


December 31, 2020 — Tags: management, career, personal

In late 2019, I was burnt out in my Director of Engineering role. I spent several sessions with a career coach outlining my professional challenges. Teams lurched from crisis to crisis. Various teams either lacked a coherent strategy, or lacked the alignment or resources to execute it effectively. Frequent confusion about roles and responsibilities caused tension. I didn’t have the resources to fix it all.

My coach finally asked:

“What would let you approach each day with energy and optimism?”

The question felt like reaching a vista after a long hike. My mood lifted as answers leapt to mind. I love being a small part of a big success. I love coaching and cheerleading colleagues working on something difficult and important. I love pairing—learning and teaching simultaneously—and fist pumping when we track down a bug. I’d be interested and excited to tackle each of my company’s particular socio-technical challenges in a focused, disciplined way. But to make time for that, I needed to significantly change my role.

I shared the revelation with my manager; and a few short weeks later, I handed off management responsibilities to a colleague. I became a Principal Engineer rather than Director. I’ve spent the past year mostly as an individual contributor, and mostly loving my work.

My coach’s question has become my mantra as I set my daily intentions. It’s honed my ability to focus on where I can make meaningful progress, and let go of the rest. It helps me orient my schedule around what’s important rather than what’s urgent.

In 2020, COVID and an immunocompromised family member upheaved my daily routines. My household navigated remote schooling and daycare with two working-from-home parents. Throughout these changes, I’m thankful for many blessings. In particular, I’m thankful for this mantra, which helped me adapt to new roles at work and at home. It’s improved my satisfaction both at work, and with my family.

As I think of goals and intentions for the new year, I’m asking myself, “what could I work on with genuine energy and optimism”?