Incident Response OODA

Digital Operations Incident Response

It’s 2am. You’re sound asleep dreaming of marshmallow clouds, taco cats, and rainbow unicorns. Suddenly, your pleasant dreams are shattered by the piercing notes of your phone’s alert tone. Your eyes snap open but your brain isn’t fully engaged yet. You look at the screen and the funny shapes slowly resolve into letters, then words, and finally thoughts. Your site is down! Paralyzed by the shock of the rude awakening, you’re not sure what to do. Is it a problem with the order manager again? Is the hosting provider down? Did someone deploy code that broke everything? Did the payment gateway break? What do you do? Where do you look?

These are the times when you’re grateful that you have a standard incident response plan in place that provides clarity and actionable tasks when an incident occurs.

A good incident response plan looks like the OODA loop. OODA stands for Observe Orient Decide Act. The OODA loop idea was created by US Air Force Lt. Colonel John Boyd. After retiring from the USAF, he distilled his experience as a master strategist and tactician into the concept of the OODA loop. His goal was to create a framework for individuals and organizations to be able to make decisions in an uncertain environment. For all of that, it sets a fine framework for Digital Operations Incident Response.

Let’s bring this high concept down to the reality of Digital Operations Incident Response. First, we observe. What error generated the alert? What does that error mean in context? What other outside influences are impacting your operation now? Does this alert have a business impact? Are all users of the platform impact, or just a specific subset of users? These observations are important first steps toward resolving the issue.

Next, we orient to the situation. How does the current situation compare to the expected state of the platform? Have we dealt with a similar situation in the past? Have we made any recent changes to the platform? Did something change “upstream”?

Based on the feedback from the previous two steps, we can decide the action to take. We record the fact that we made the decision and the factors that went into that decision. We can also decide who will act and when that action will take place.

Finally, we act based on the decision from the previous step.

If we’ve done things right, that action will have changed the situation, so we loop back to the top of the OODA loop to observe the new situation. If we’ve mitigated or resolved the issue, then we’re done. If the issue persists, we step back through the process. We continue this until the incident is resolved.

That’s the basics of the process. In practice, it’s a bit more complicated. Often, we have multiple outcomes that must be met before we can call an incident fully resolved. A few of the major concerns are the need be sure we’ve documented everything, the need to understand if we’ve simply mitigated an incident, or if there’s more work required to fully resolve the issue and we must communicate details of the incident with stakeholders.

We also need to have a process in place to handle situations where the initial incident response is insufficient, and the response must be escalated to bring in more resources to help mitigate the situation.

To make sure we get all these points covered, we have to cross our OODA loop with a larger problem management process. The problem management process consists of three steps: Identify, Mitigate, and Resolve. Through all of this, we include communication and escalation steps.

Escalation is required at any point in the incident response process when the current responders are unable to effectively execute the OODA loop. When this happens, the incident responders need a clear path for asking for help. Having a plan in place that defines the escalation path for incident responders is a key component to providing confidence for your incident responders. They need to know that there is help available when they get stuck.

Communication throughout the process of the incident response, mitigation, and resolution is key. Stakeholders and customers are keenly interested in knowing that the platform is being cared for. When things go wrong, they need to have confidence that the incident is being properly handled and that they will be back up and running as soon as possible. Regular communication helps keep stakeholders informed and confident in the process and the platform.

Incident response ends when an incident is mitigated, meaning the condition that created the incident is no longer present. Sometimes this means that service is restored, but the conditions that created the incident could return to create a new incident. This is when the incident is documented and re-packaged as a problem to be resolved by more changes to the platform. Once the platform is made resistant to the conditions that created the incident, the problem is considered resolved.

The Internet is a dynamic place. Having a solid incident response plan in place brings stability and confidence to your platform. Incident responders will know what to do and you will have a solid process in place that will improve your platform.

Ben Vaughan
Ben Vaughan
Maker of Things That Make Things

My interests include DevSecOps, CI/CD, observability, and incident response.