ktheoryAaron Suggs’s blog

September 7, 2021 — Tags: site reliability, programming

Below is the mental model I use when designing or reviewing web services to fail gracefully. It’s useful to enumerate common failure types, and ensure engineers intentionally plan for them.

For each network call the service makes (to a data store or API), consider how it would would behave if:

  • the call immediately returns an error. Do you have appropriate error handling, instrumentation, and retry logic?
  • the call never returns. Do you have an appropriate timeout? Can you determine if the network call should be retried? Ruby’s Unicorn web server has a concise introduction of application timeout considerations.
  • the responses are temporarily incorrect. Do you have logging and instrumentation to figure out which data are affected?

By addressing these 3 questions, you’ve built a solid foundation for an reliable, maintainable web service.