ktheoryAaron Suggs’s blog

July 28, 2021 — Tags: incidents, site reliability

I appreciate reading stories of how complex software systems fail and the hard-earned lessons to make them more resilient. In fact, one of my favorite software interview questions is “tell me about a time you were involved in a production incident.”

Here is one of my personal favorite incident anecdotes. It sticks out because of the cognitive bias that slowed our diagnosis, and how thoroughly we were able to prevent similar incidents in the future.

Key lessons

  1. It’s useful to reconfirm basic assumptions if Subject Matter Experts are stumped.
  2. Listen to all the voices in the room.
  3. Thorough remediation means mitigating future failures in multiple independent ways.

Setting the scene

It was early 2015 at Kickstarter. Our Rails app used 3 memcached servers running on EC2 as a read-through cache. We were expecting a high-visibility project to launch in the coming days, so per our standard practice, we scaled up our unicorn app processes by 50%. In this case, that meant going from 800 to 1200 unicorn workers.

In prior months, we’d been battling DDOS attacks, so I was primed to expect unusual app behavior to be a new type of abusive traffic.

The incident

Out of the blue, our team was paged that the site was mostly unresponsive. A few clients could get a page to load within our ~60 second timeout, but more often clients got a 504 gateway timeout error. Several engineers, including myself, joined our incident slack channel to triage.

Digging into our APM dashboards, we saw that the public stats page saturating our database CPU with slow queries, which meant our unicorn web workers hung while waiting on DB queries to render pages.

That was strange because while the stats queries are slow, we kept the cache warm with a read-through and periodic write-through strategy. If the results fell out of cache, the page should hang for just a few seconds; not cause site-wide impact for several minutes.

“It’s as if memcached isn’t running,” said one prescient engineer. I ignored the comment, too deep in my own investigation. Memcached doesn’t crash, I thought. It must be our app bug, or some clever new denial-of-service vector to generate DB load.

After roughly 40 minutes of fruitless head scratching, the prescient engineer piped in, “I ssh’ed into one of the cache servers, and memcached isn’t running.”

If we’d had an Incident Manager role, we’d likely have checked memcached sooner.

Biggest. Facepalm. Ever.

The fix

Moments after we confirmed memcached wasn’t running, we restarted it with /etc/init.d/memcached restart, and the site recovered within a few seconds.

With the incident mitigated, our investigation continued. Why wasn’t memcached running? Our cache cluster had been healthy for years. The EC2 hosts were healthy. Yet each memcached process had crashed in the past few hours. Only in retrospect did we observe that the site was slightly slower as the first 2 crashed. We certainly noticed the near-complete outage when the final process crashed.

Digging through our app logs, I noticed sporadic connection errors to memcached. Apparently, we still had the default ulimit of 1024. So when we scaled to 1200 app workers, only 1024 could connect, and the remaining 176 would get errors. The Ruby memcached client would automatically attempt to reconnect every few seconds.

I was still puzzled why memcached had crashed, so I searched through the code commits for anything mentioning “crash.” And eureka! This commit mentions exactly our issue: as clients connect and disconnect when memcached is at the ulimit’s max connections, a race condition can crash the server. The default version of memcached that came with our Ubuntu version happened to predate the fix. I was able to reliably recreate the crash in a test env.

With all this in hand, the team implemented several fixes:

  1. I ported the default init.d script to runit, our preferred tool at the time, to automatically start processes if they crash. This would make the impact of the crash negligible.
  2. We increased the ulimit to accommodate more workers. This improved latency because ~15% of our workers were effectively without cache.
  3. We upgraded memcached to patch the ulimit issue.
  4. Send an alert if memcached isn’t running on a cache server to reduce our time-to-detect.

Items 1-3 are each sufficient to prevent this particular memcached crash from having a significant impact on our customers.

This was the first and only incident with memcached crashing in my 7 years at Kickstarter.

Wrapping up

This incident taught me to be a better listener to all the voices in the room, even if it means questioning assumptions that have served me well before.

And it taught me to be tenacious in tracking down causes for failures, rather than stopping at the first sufficient mitigation. Reading the source code can be fun and rewarding!