ktheoryAaron Suggs’s blog

May 16, 2021 — Tags: programming, release engineering

My team has been discussing the role of various test and development environments. We’d like to provide guidance for what developers should test locally on their laptop, on an ad hoc deployed environment, on a pre-prod environment, and on production.

I’d like to share some criteria that help me organize the value and purpose of various environments.

Let’s start with 3 key features of an ideal dev env:

  • Fast feedback. It should be as quick as possible to change some code in your editor and see the effect in your dev environment.
  • Similar to production. The environment should be maximally similar to the full productionsystem. There are several aspects in which it may be similar or different, such as the infrastructure it runs on, configuration, data, and traffic load.
  • Isolated. Any side effects and failures should be isolated from impacting actual users, or even your colleagues (e.g. minimize the time that one person breaking the build blocks teammates).

In practice, fast, similar, and isolated aren’t so much features, but continuous dimensions that we try to maximize. We can carve out roles for various dev envs by considering the relative importance of each dimension.

Local development

For local development environments (i.e. code running on your laptop), I’d rank the importance as follows:

  1. Isolated
  2. Fast
  3. Similar

In other words, it’s most important that local envs are isolated from breaking anything on production or anyone else’s environments. The 2nd priority is fast developer feedback as long as it doesn’t compromise isolation. And the 3rd priority is being production-like, as long as it doesn’t compromise isolation or fast feedback.

A feature like Webpack’s Hot Module Replacement and React Hot Reloading improves feedback time, but detracts from being production-like. So that’s a win for local development since ‘Fast’ is more important the ‘Similar’.

By similar reasoning, local development is a good place to run uncommitted code, or dynamically generating assets that would be immutable deploy artifacts on production.

Testing on production

What about practices that let you more safely test on production, like feature flags and blue-green deployments? I see the ranking as:

  1. Similar
  2. Isolated
  3. Fast

‘Similar’ is de facto top priority since it is production. Next up, our goal is to isolate failures and unintended side effects as much as possible. And finally, we want fast feedback as long as it doesn’t compromise isolation.

Other deployed environments

Where does that leave environments like staging, QA, or other quasi-production like environment? For decades, they’ve been a middle-ground between local development and production.

As release engineering and local development tooling improves, I’m finding fewer reasons to maintain them. More likely, I’m going to invest in ways to gain confidence in my code locally, or build ways to safely test it on production.

Let’s recall the aspects in which an environment can be production-like: infrastructure (as in the CPU and memory resources, operating system, and system libraries), configuration, data, and traffic.

Years ago infrastructure and configuration were a frequent sources of bugs. Code might work on a developer’s macOS laptop, but not on Linux server. Or we forgot to set all the environment variables we expected. Staging environments were a critical place to suss out those bugs. Lately, Infra-as-code tooling and better configuration patterns like Terraform, CloudFormation, and Docker have made these rare issues.

Most bugs I see on production today are related to data (i.e. unexpected states) or traffic (unexpected resource contention or race conditions). Those are particularly difficult to suss out in non-production environments.

Sometimes creating these non-production integration environments means adding and maintaining new code paths. For example, for Stripe’s sandbox environment, Stripe maintains different payment cards that always succeed or return predictable errors. That’s unique behavior to the sandbox environment not found on production. In order to be useful for isolated testing, they had to compromise on being similar to production. As I think about a constellation of microservices that could make up a complete test environment, the support cost of these alternate code paths can add up quickly.

For SRE / Platform / Release Engineering teams tasked with supporting developers on the entire delivery lifecycle, we must choose where our attention can have the most impact for the organization. I’m finding that ever more often the focus in on fast local development and safe production releases, and there are fewer reason for maintaining non-production deployed environments.

Check out ”The value of reliable developer tooling” for some of my prior work on dev envs.