This post is an update on a previous post about how Heroku handles incident response. As a service provider, when things go wrong, you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like […]
This post is an update on a previous post about how Heroku handles incident response.
As a service provider, when things go wrong, you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like Heroku’s.
At Heroku we’ve codified our practices around these aspects into an incident response framework. Whether you’re just interested in how incident response works at Heroku, or looking to adopt and apply some of these practices for yourself, we hope you find this inside look helpful.
We describe Heroku’s incident response framework below. It’s based on the Incident Command System used in natural disaster response and other emergency response fields. Our response framework and the Incident Commander role in particular help us to successfully respond to a variety of incidents.
When an incident occurs, we follow these steps:
They will assess the issue, and decide if it’s worth investigating further
The Incident Commander creates a new room in Slack, to centralize all the information for this specific incident
Our customers want information about incidents as quickly as possible, even if it is preliminary. As soon as possible, the IC designates someone to take on the communications role (“comms”) with a first responsibility of updating the status site with our current understanding of the incident and how it’s affecting customers. The admin section of Heroku’s status site helps the comms operator to get this update out quickly:
The status update then appears on status.heroku.com and is sent to customers and internal communication channels via SMS, email, and Slack bot. It also shows on twitter:
Next the IC compiles and sends out the first situation report (“sitrep”) to the internal team describing the incident. It includes what we know about the problem, who is working on it and in what roles, and open issues. As the incident evolves, the sitrep acts as a concise description of the current state of the incident and our response to it. A good sitrep provides information to active incident responders, helps new responders get quickly up to date about the situation, and gives context to other observers like customer support staff.
The Heroku status site has a form for the sitrep, so that the IC can update it and the public-facing status details at the same time. When a sitrep is created or updated, it’s automatically distributed internally via email and Slack bot. A versioned log of sitreps is also maintained for later review:
The next step is to assess the problem in more detail. The goals here are to gain better information for the public status communication (e.g. what users are affected and how, what they can do to work around the problem) and more detail that will help engineers fix the problem (e.g. what internal components are affected, the underlying technical cause). The IC collects this information and reflects it in the sitrep so that everyone involved can see it. It includes the severity, going from SEV0 (critical disruption), to SEV4 (minor feature impacted)
Once the response team has some sense of the problem, it will try to mitigate customer-facing effects if possible. For example, we may put the Platform API in maintenance mode to reduce load on infrastructure systems, or boot additional instances in our fleet to temporarily compensate for capacity issues. A successful mitigation will reduce the impact of the incident on customer apps and actions, or at least prevent the customer-facing issues from getting worse.
In coordinating the response, the IC focuses on bringing in the right people to solve the problem and making sure that they have the information they need. The IC can use a Slack bot to page in additional teams as needed (the page will route to the on-call person for that team), or page teams directly.
As the response evolves, the IC acts as an information radiator to keep the team informed about what’s going on. The IC will keep track of who’s active on the response, what problems have been solved and are still open, the current resolution methods being attempted, when we last communicated with customers, and reflect this back to the team regularly with the sitrep mechanism. Finally, the IC is making sure that nothing falls through the cracks: that no problems go unaddressed and that decisions are made in a timely manner.
Once the immediate incident has been resolved, the IC calls for the team to unwind any temporary changes made during the response. For example, alerts may have been silenced and need to be turned back on. The team double-checks that all monitors are green and that all incidents in PagerDuty have been resolved.
Finally, the Production Engineering Department will tee up a post-incident follow up. Depending on the severity of the incident, this could be a quick discussion in the normal weekly operational review or a dedicated internal post-mortem with associated public post-mortem post. The post-mortem process often informs changes that we should make to our infrastructure, testing, and process; these are tracked over time within engineering as incident remediation items.
As Heroku is part of the Salesforce Platform, we leverage Salesforce Incident Response, and Crisis communication center when things gets really bad.
If the severity decided by the IC is SEV1 or worse, Salesforce’s Critical Incident Center (CIC) gets involved. Their role is to assist the Heroku Incident Commander with support around customer communication, and keep the executives informed of the situation. They also can engage the legal teams if needed, mostly for customer communication.
In the case where the incident is believed to be a SEV0 ( major disruption for example ), the Heroku Incident Commander can also request assistance from the Universal Command (UC) Leadership. They will help to assess the issue, and determine if the incident really rises to the level of Sev 0.
Once it is determined to be the case, the UC will spin up a conference call ( called bridge ) involving executives, in order for them to have a single source of truth to follow-up on the incident’s evolution. One of the goals is that executives don’t first learn failures from outsides sources. This may seem obvious, but amidst the stress of a significant incident when we’re solely focused on fixing a problem impacting customers, it’s easy to overlook communicating status to those not directly involved with solving the problem. They are also much better suited to answer to customers requests, and keep them informed of the incident response.
The incident response framework described above draws from decades of related work in emergency response: natural disaster response, firefighting, aviation, and other fields that need to manage response to critical incidents. We try to learn from this body of work where possible to avoid inventing our incident response policy from first principles.
Two areas of previous work particularly influenced how we approach incident response:
Our framework draws most directly from the Incident Command System used to manage natural disaster and other large-scale incident responses. This prior art informs our Incident Commander role and our explicit focus on facilitating incident response in addition to directly addressing the technical issues.
The ideas of Crew Resource Management (a different “CRM”) originated in aviation but have since been successfully applied to other fields such as medicine and firefighting. We draw lessons on communication, leadership, and decision-making from CRM into our incident response thinking.
We believe that learning from fields outside of software engineering is a valuable practice, both for operations and other aspects of our business.
Heroku’s incident response framework helps us quickly resolve issues while keeping customers informed about what’s happening. We hope you’ve found these details about our incident response framework interesting and that they may even inspire changes in how you think about incident response at your own company.
At Heroku we’re continuing to learn from our own experiences and the work of others in related fields. Over time this will mean even better incident response for our platform and better experiences for our customers.