notes on incident response

When you’re on call to help run an online service and you get paged, then the page might go a few different ways. Maybe the alert is something that you’re familiar with; so you take the standard steps, you take a note to bring it up in the next retrospective in hopes that your team can decrease its frequency, and then you go back to whatever you were doing. Sometimes the alert is something odder, or the standard steps don’t work; then you dig into the logs, figure out what’s going on and how bad it is, and, after a bit of looking, you get a handle on the issue and the system goes to normal.

But sometimes that alert is a sign of something rather different. There’s a serious problem appearing that’s already starting to affect your users and where the effects could grow if you don’t get on top of it immediately. You don’t really understand what’s going on and you feel like you’re starting to lose control: an outage is brewing. I want to talk about that third situation here.

The short version of this post is: in these outage or potential outage situations, your psychology is probably working against you. The tradeoffs in an outage situation are different from a normal on-call situation and from normal non-on-call work; instead of having one or two people digging deeply into a problem, you want a team of people looking at the problem broadly. And that in turn creates a need for coordination; take that coordination role seriously as well.

Normally, when you’re working as an engineer, you’re trying to understand something and deal with it as efficiently as possible. You’ve got a guess as to what’s probably going on in a given situation and what to do. So you spend some time following where that guess leads, making sure that that guess is the right one. If it is, great; if not, you try out other possibilities. If you get really stuck, you’ll ask a team member for help, but you don’t want to interrupt them unless you really need to: they have work to do as well, after all, and if it’s outside of regular work hours, then they have lives to lead. So you dot your i’s and cross your t’s before asking for help.

In an outage or potential outage situation, however, almost all of that is wrong. There’s a core that remains relevant: you don’t want to make things worse, so you certainly need some level of care. But, given that constraint, your goal should be to get the situation under control as quickly as possible: fixing the problem is a higher priority than completely understanding the situation, it’s a higher priority than not bothering other people. (Don’t get me wrong: understanding is good! But, as long as you can fix the situation safely, then it’s okay if the understanding happens after the fact.)

Part of the implication here is that, in an outage situation, you should be willing to take remediative actions even when you don’t have a clear linkage between those actions and the cause of the outage. For example, if a relevant service was recently upgraded, then try downgrading it; if some feature flags were recently flipped, try setting them back to their prior state. This isn’t always the right course of action: if the service is one where upgrades are problematic for some reason, then there might be a noticeable risk that downgrading it will make things worse. But, in the normal happy case where downgrades are fast and safe, then you might want to just go ahead and do that: it almost certainly won’t make the situation worse, and it might make the situation better.

So part of the answer is that, in these uncertain situations, you want to do safe but speculative actions more often than you would otherwise. But an even bigger part of the answer is that a straight line approach of having one person dig into their best idea of what’s going on is not going to be the quickest way to reach a solution: instead, you want to take a set-based approach.

In our story above, the on-call is digging into their best guess as to what’s causing the problem. And that’s a good thing to do, no question! But they probably have a second and third guess as to what might be relevant; it would be great if somebody else could be looking at those possibilities as well. And, as I mentioned above, you might have some speculative remediation activities that somebody should try out. (“Have you tried turning it off and on again?”) But if the problem is mysterious enough, then none of that will work: so ideally you’d also be digging into the logs to see what unexpected bits of evidence are lurking in them, and you’ll also want to look at recently deployed changes and recent feature flags to see if they provide any clues. And there’s probably somebody in your organization who’s been around a while and has seen all sorts of things, who is unusually good at figuring out what’s going on when the system is behaving strangely; it would be great if that person were here to help.

That’s a lot of different things to try! (Which is why we call this a set-based approach: you’ve got a set of actions to undertake, and you’d like to perform as many of them in parallel as possible.) So, in this outage / impending outage situation, you need to shift out of a mindset of “the on-call will deal with this” to “we need a bunch of people looking at things and digging into different possibilities”; doing so will dramatically shrink the resolution time. And this, in turn, changes the problem into one of group problem solving and coordination.

Coordinating a group is hard work, and one that requires specialized skills. You could ask the on-call to switch into that coordination mode; but that kind of mindset shift is hard. So, at my current job, we have a separate rotation for this (a rotation of senior engineers and managers pulled from different engineering teams), called the “Incident Response Coordinator” (IRC) rotation. The IRC’s job is to help coordinate the response to incidents that are already serious or are showing signs that they will get serious, with the goal of getting the incident resolved safely as quickly as possible.

A big part of their job is to implement the set-based approach. Find an appropriate group of people to work on the problem; at first, maybe it’s just the on-call and the secondary, but by the time the IRC is involved, it’s usually pretty clear that we want more people to get involved. So one of the things that the IRC is doing is notifying other team members that they should join in if available, and giving those people a summary of the problem and a pointer to the Zoom meeting where people are gathering. And once they’ve gathered, make sure that they’re digging into different aspects of the problem.

Another part of the IRC’s job is to shield those people from distractions while making sure that necessary tasks get taken care of. The on-call and most of the other people working on the problem are trying to focus on understanding the problem and digging into one aspect or another of what’s going on; but if the incident is turning into an outage, then there are other things that somebody needs to do. (We should post on the status page to let users know of the issue and to keep them updated, we should check with support to get a sense of the impact and to help them answer questions coming in from users, etc.) And maybe there are other things that could be done to shield the people who are digging into the issue; for example, maybe the on call needs to spend their time really digging into some aspect of the problem but is being interrupted by a flood of alerts; if that’s the case, can we find somebody else on the team who can temporarily hold the pager?

A third part of their duties is to make sure that the flow of the response looks right. The IRC should never be actually working directly on solving the problem: in fact, if the IRC is somebody whose domain expertise would be particularly useful with the incident in question, then they should find somebody else to wear the IRC hat instead! But the IRC can make sure that the group of people working on the problem is expanding appropriately (expanding from a few team members to the whole team that owns the service in question to adjacent teams to random other engineers who are particularly good problem solvers); and the IRC can make sure that people aren’t getting so wrapped up in trying to understand the problem that they aren’t putting in potential fixes in place. (Also, as a side note: once a potential fix is in place, people should switch to trying to figure out how we can validate the fix as quickly as possible and, if the signals that we’re looking for show positive signs, how we can turn up the dials to have the fix take effect as quickly as possible. Don’t just let the autoscaler add a few nodes: double the cluster size, and if that helps, double it again!)

For really complex problems, I also like to have the IRC manage a list of actions and hypotheses that we’re tracking. Every time somebody comes up with an idea of some sort, stick it on the list; if somebody is at loose ends, have them grab an item off the list and put their initials next to it; once they’re done looking into it, cross it off (and add more items that are discovered as part of their investigation).

Also, make sure that we’ve got some level of notes as to what we’ve come up with. During the outage, what’s important is to solve the problem, not to understand every aspect of the problem, but the next day the team is going to want to start digging in deeper to figure out what was really going on so they can put in further safeguards for the problem in the future. So the IRC should make sure that key observations are written down somewhere, to help with that further investigation.

Done well, this can make a real difference: it can turn a multi-hour outage into one that lasts less than an hour or even into one that barely qualifies as an outage at all. But doing this right takes effort and skill. You need to recognize that individual engineers’ instincts will lead them to behave in ways that aren’t optimized for quickly bringing the problem to resolution; you need to instead bring in a group of people who are all working on different aspects of the problem in parallel; and you need to focus on the metric of safely resolving the situation as quickly as possible.

Post Revisions:

This post has not been revised since publication.

Published 8/28/2023 & Filed in Programming

malvasia bianca