September 7, 20211 minute to read — Tags: site reliability, programming, featured

Below is the mental model I use when designing or reviewing web services to fail gracefully. It’s useful to enumerate common failure types, and ensure engineers intentionally plan for them.

For each network call the service makes (to a data store or API), consider how it would would behave if:

  • the call immediately returns an error. Do you have appropriate error handling, instrumentation, and retry logic?
  • the call never returns. Do you have an appropriate timeout? Can you determine if the network call should be retried? Ruby’s Unicorn web server has a concise introduction of application timeout considerations.
  • the responses are temporarily incorrect. Do you have logging and instrumentation to figure out which data are affected?

By addressing these 3 questions, you’ve built a solid foundation for a reliable, maintainable web service.


Aaron Suggs
Hi, I'm Aaron Suggs. 😀👋

Welcome to my personal blog. I manage engineering teams at Instructure, previously Lattice, Glossier and Kickstarter. I live in Chapel Hill, NC. Find me on LinkedIn, and GitHub.