diff --git a/docs/about.md b/docs/about.md index 5b3a31e..d942f64 100644 --- a/docs/about.md +++ b/docs/about.md @@ -1,6 +1,6 @@ --- cover: assets/img/covers/incident_response_docs.png -hero: assets/img/headers/pagerduty_logo.png +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png hero_alt_text: PagerDuty --- This site documents parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md index 4751766..52a7713 100644 --- a/docs/after/post_mortem_process.md +++ b/docs/after/post_mortem_process.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/post-mortem_process.png description: For every major incident (SEV-2/1), we need to follow up with a postmortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. -hero: assets/img/headers/pagerduty_post_mortem.jpg -hero_alt_text: Postmortem --- For every major incident (SEV-2/1), we need to follow up with a postmortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included. diff --git a/docs/assets/img/crisis/01_oncallrestrictions.png b/docs/assets/img/crisis/01_oncallrestrictions.png new file mode 100644 index 0000000..cb9e928 Binary files /dev/null and b/docs/assets/img/crisis/01_oncallrestrictions.png differ diff --git a/docs/assets/img/crisis/02_escalationtimeout.png b/docs/assets/img/crisis/02_escalationtimeout.png new file mode 100644 index 0000000..5d4ef54 Binary files /dev/null and b/docs/assets/img/crisis/02_escalationtimeout.png differ diff --git a/docs/assets/img/crisis/03_roundrobin.png b/docs/assets/img/crisis/03_roundrobin.png new file mode 100644 index 0000000..ca47bce Binary files /dev/null and b/docs/assets/img/crisis/03_roundrobin.png differ diff --git a/docs/assets/img/crisis/04_remediationdocs.png b/docs/assets/img/crisis/04_remediationdocs.png new file mode 100644 index 0000000..c982bdd Binary files /dev/null and b/docs/assets/img/crisis/04_remediationdocs.png differ diff --git a/docs/assets/img/crisis/05_priorities.png b/docs/assets/img/crisis/05_priorities.png new file mode 100644 index 0000000..0779a2a Binary files /dev/null and b/docs/assets/img/crisis/05_priorities.png differ diff --git a/docs/assets/img/crisis/06_incidentworkflows.png b/docs/assets/img/crisis/06_incidentworkflows.png new file mode 100644 index 0000000..024909e Binary files /dev/null and b/docs/assets/img/crisis/06_incidentworkflows.png differ diff --git a/docs/assets/img/crisis/07_operationscloud.png b/docs/assets/img/crisis/07_operationscloud.png new file mode 100644 index 0000000..b3fa0fb Binary files /dev/null and b/docs/assets/img/crisis/07_operationscloud.png differ diff --git a/docs/assets/img/crisis/08_postmortemdraft.png b/docs/assets/img/crisis/08_postmortemdraft.png new file mode 100644 index 0000000..f8c9b0e Binary files /dev/null and b/docs/assets/img/crisis/08_postmortemdraft.png differ diff --git a/docs/assets/img/crisis/09_usercontactinfo.png b/docs/assets/img/crisis/09_usercontactinfo.png new file mode 100644 index 0000000..f46ec90 Binary files /dev/null and b/docs/assets/img/crisis/09_usercontactinfo.png differ diff --git a/docs/assets/img/crisis/10_highurgencynotifications.png b/docs/assets/img/crisis/10_highurgencynotifications.png new file mode 100644 index 0000000..ff4179d Binary files /dev/null and b/docs/assets/img/crisis/10_highurgencynotifications.png differ diff --git a/docs/assets/img/crisis/11_escalationpolicy.png b/docs/assets/img/crisis/11_escalationpolicy.png new file mode 100644 index 0000000..790d827 Binary files /dev/null and b/docs/assets/img/crisis/11_escalationpolicy.png differ diff --git a/docs/assets/img/crisis/12_schedulelayers.png b/docs/assets/img/crisis/12_schedulelayers.png new file mode 100644 index 0000000..63748f3 Binary files /dev/null and b/docs/assets/img/crisis/12_schedulelayers.png differ diff --git a/docs/assets/img/crisis/13_incidentworkflows.png b/docs/assets/img/crisis/13_incidentworkflows.png new file mode 100644 index 0000000..2fd5b5c Binary files /dev/null and b/docs/assets/img/crisis/13_incidentworkflows.png differ diff --git a/docs/assets/img/crisis/14_incidentstatusupdates.png b/docs/assets/img/crisis/14_incidentstatusupdates.png new file mode 100644 index 0000000..182575b Binary files /dev/null and b/docs/assets/img/crisis/14_incidentstatusupdates.png differ diff --git a/docs/assets/img/crisis/cover_crisisresponse.png b/docs/assets/img/crisis/cover_crisisresponse.png new file mode 100644 index 0000000..70bd59e Binary files /dev/null and b/docs/assets/img/crisis/cover_crisisresponse.png differ diff --git a/docs/assets/img/crisis/hero_EUEmergencyResponseCoordinationCentreBrussels.png b/docs/assets/img/crisis/hero_EUEmergencyResponseCoordinationCentreBrussels.png new file mode 100644 index 0000000..4ffbdc1 Binary files /dev/null and b/docs/assets/img/crisis/hero_EUEmergencyResponseCoordinationCentreBrussels.png differ diff --git a/docs/assets/img/crisis/hero_FEMAJocelynAugustino.png b/docs/assets/img/crisis/hero_FEMAJocelynAugustino.png new file mode 100644 index 0000000..ba01848 Binary files /dev/null and b/docs/assets/img/crisis/hero_FEMAJocelynAugustino.png differ diff --git a/docs/assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png b/docs/assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png new file mode 100644 index 0000000..58d2256 Binary files /dev/null and b/docs/assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png differ diff --git a/docs/before/call_etiquette.md b/docs/before/call_etiquette.md index 4fa1673..c34a28a 100644 --- a/docs/before/call_etiquette.md +++ b/docs/before/call_etiquette.md @@ -1,11 +1,6 @@ --- cover: assets/img/covers/call_etiquette.png description: You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of. -hero: assets/img/headers/obama_phone.jpg -hero_alt_text: Obama Phone -hero_credit_url: https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg -hero_credit_url_text: Official White House Photo -hero_credit_text: by Pete Souza --- You've just joined an incident call and you've never been on one before. You have no idea what's going on or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of. diff --git a/docs/before/what_is_an_incident.md b/docs/before/what_is_an_incident.md index ad9d650..773cf95 100644 --- a/docs/before/what_is_an_incident.md +++ b/docs/before/what_is_an_incident.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/incident.png description: Before defining an incident response process, we should first define what an incident (and a major incident) is, along with how we should trigger the response for such incidents. -hero: assets/img/headers/server_incident.png -hero_alt_text: Incident --- Before we can define our incident response process, we should first define what an incident (and a major incident) is. diff --git a/docs/crisis/crisis_intro.md b/docs/crisis/crisis_intro.md new file mode 100644 index 0000000..eed3157 --- /dev/null +++ b/docs/crisis/crisis_intro.md @@ -0,0 +1,24 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: Your organization's crisis response plan requires strong leadership. The right kind of crisis leadership is values-driven and maintains the balancing act between carefully and thoughtfully responding to what went wrong and deliberately capturing mindshare or new business based on the effectiveness of your response. +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png +--- + +## Introduction to Crisis Response Management Operations + +A critical partner in your supply chain just went down. An earthquake just hit your main operations hub. Breaking news about your organization just hit social media. A crisis can happen at any time. Are you ready for it? The way you handle your worst day will leave lasting impressions about your brand and its perceived value in the eyes of your current and potential customers. + +Bad news first. There's always another crisis or existential threat on the horizon. If you don’t have an established Crisis Response process and team in place, you’re running a high risk of failure. If you do have a process and team, you should be continuously iterating and improving your leadership, plans and practices to guard against mistakes that can cause irreparable damage to your brand. + +The good news is that this guide is built to bring your Crisis Response Management Operations up to speed using best practices, and leveraging PagerDuty’s Operations Cloud. + +## Audience +Business leaders responsible for crisis, risk and / or emergency management who want to enhance their crisis response processes and improve their mean time to respond (MTTR). + +## Content of this Guide +- [Terminology](terms.md) - a list of key terms and concepts used in this guide +- [Crisis Leadership](leadership.md) - incorporating basic principles and your values in your response +- [Crisis Response Operations](operations.md) - activating your crisis response plans +- [Pre-crisis Phase](prework.md) - capitalizing on preparedness activities to keep your teams ready and engaged +- [PagerDuty for CRMOps](pagerduty.md) - how PagerDuty leverages PagerDuty for crisis response management operations + diff --git a/docs/crisis/leadership.md b/docs/crisis/leadership.md new file mode 100644 index 0000000..1f6fa14 --- /dev/null +++ b/docs/crisis/leadership.md @@ -0,0 +1,98 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: Your organization's crisis response plan requires strong leadership. The right kind of crisis leadership is values-driven and maintains the balancing act between carefully and thoughtfully responding to what went wrong and deliberately capturing mindshare or new business based on the effectiveness of your response. +--- + + +Effective crisis management requires leadership. Crisis Leadership underlines how your corporate leaders apply your organization’s values to all stages of a crisis. + +## Why is Crisis Leadership important for your organization +With every crisis, there is danger and opportunity. The right kind of leadership is vital in the critical moments of your company’s history. The right kind of crisis leadership is values-driven and maintains the balancing act between carefully and thoughtfully responding to what went wrong and deliberately capturing mindshare or new business based on the effectiveness of your response. + +When your company’s values are at the forefront, your stakeholder communications and public statements remain consistent. Your audience can always tell when you’re backpedaling from established viewpoints or bandwagoning. You avoid compounding the situation by being consistent. No two crises are alike just as no two organizations are alike. Crisis Leadership centers on you—not others—telling your constituents your organization’s story from one crisis to the next. + +## Considerations for Crisis Leaders + +The unpredictable and fluid nature of a crisis requires situational awareness. Being aware of what you know and don't know is crucial. Continually monitoring the situation, predicting statuses and being prepared to roll with the changing environment makes your company adept at crisis response and provides your team with purpose, i.e., everyone is in sync and working towards the same goal. + +An increasingly important aspect of Crisis Leadership is taking care of yourself and your team. Members of your crisis response team may have been impacted by the events but are still working to resolve it. Some of your team may have been awake for 24 hours needing someone to give them permission to step away. Fatigue may be setting in and so forth. Leveraging the functionality built into the PagerDuty platform to establish on-call rotations, hand-offs and integrate video conferencing technology like Zoom or Teams can help create a safe and healthy [on call culture](https://goingoncall.pagerduty.com/culture/) for your teams while responding to what could be a protracted situation. + +![On-call Restrictions by day and hour](../assets/img/crisis/01_oncallrestrictions.png) + +## Do's and Don'ts when Leading A Crisis +Successful and unsuccessful corporate responses to crises are all around us. In fact, the chances are high that there’s one of each happening in the news at the time you're reading this guide. What’s important is that you learn from the very public mistakes of others and develop your core principles in the form of a do's and don'ts list. Some of them may be obvious but they’re still worth documenting. Here are a few common examples: + +- **DO** have a set of generic holding statements ready to go that can be easily customized for specific situations (e.g., vendor bankruptcy, cyber incident, product recall, high profile departure, etc.) + +- **DO** be cautious about when and how you respond, as there is always a risk that the news could break before you’ve commented + +- **DO** be measured in your response and avoid playing whack a mole trying to respond to every negative post, inquiry or attack + +- **DON’T** assume multiple crises or incidents happening at the same time are related + +- **DON’T** copy and paste—i.e., take actions that are unique to your organization’s values, history, and risk profile, and within your capabilities or you’ll risk greater exposure + +- **DON’T** assume that making proactive non-obligatory public statements are not without great risk—you need to carefully weigh your decision with your Legal team in this regard + +- **DON’T** assume that what you've said internally or to a subset of customers or investors won’t go public + +## Crisis Scenario Planning + +Crisis Leaders should always plan for the company’s worst day before it becomes a reality. If the 2020 pandemic was your first existential crisis, Murphy’s Law says it won’t be your last. It’s likely you’ll experience multiple crises during your tenure at an organization. Referencing your company’s historical crises while planning is one piece of the puzzle. However, scenario planning is forward-looking and hones in on the most likely and most damaging crisis scenarios for your organization to proactively develop teams, plans and playbooks. There are a myriad of scenarios to choose from but here are a few examples: + +- Critical infrastructure attack (e.g., power, water, transportation) +- Cyber incident (e.g., ransomware, data breach) +- Pandemics (e.g., contagions) +- Environmental disaster (e.g., earthquake, hurricane, drought) +- Human resources crisis (e.g., union strikes, walk-outs, labor shortage) +- Geopolitical disaster (e.g., war, coup d’etat) +- Terrorism (e.g., political violence, sabotage) +- Economic disaster (e.g., stock market crash, currency crisis) +- Industrial accident (e.g., gas leak, building collapse) + +If time was infinite and the world was static, you could plan for all of the scenarios in the world. However, the goal is to select a handful of scenarios from your list and build transferable principles and skills that prepare you for a wider range of crises. Another way to do that is by focusing on the consequences across your scenarios and solving for those capability gaps by adding controls such as playbooks, runbooks or predefined tactical response teams. You may also find that the order of criticality changes as the operating environment changes so periodic review of your top scenarios and the associated plans and teams is important. + +## Assembling An Executive Crisis Leadership Team + +Developing an Executive Crisis Leadership Team is a good starting point when considering the scope, scale and role of your Crisis Response team. This group will consist of functional business owners from all areas of your organization from Communications to Legal to Human Resources and so on. Consider starting with some or all of the following functional roles: + +- Chief Executive Officer +- Chief Legal Officer +- Chief Communications Officer +- Chief Financial Officer +- Chief Information Security Officer +- Chief Human Resource Officer +- Chief Operating Officer +- Chief Information Officer +- Chief Resilience Officer +- Chief Revenue Officer +- Chief Marketing Officer +- Chief Security Officer + +There’s no one size fits all and you may not need all of these roles in your Executive Crisis Leadership Team. It’s also important to consider your Board of Directors—if you have one—as an extension of your Executive Crisis Leadership Team. Similarly, external resources like Public Relations/Crisis Management firms, Disaster Recovery services, insurance providers, Digital Forensic Specialists or Local/Federal authorities should not be overlooked as essential contacts to document. + +## Crisis Team Leaders + +It’s important to put a face and single voice to a crisis. A Crisis Team Leader is the individual responsible for leading the organization through a crisis having overall responsibility based on their area(s) of expertise. They’re similar to an Incident Commander for a crisis situation. However, a Crisis Team Leader may function more as an Area Commander if there are multiple Incident Commanders to oversee in a complex situation. + +Once you’ve built your handful of scenarios, assigning members of your organization as the team leader along with their backup is the next step. See the below table as an example: + +| **Crisis Scenario** | **Scenario Examples** | **Crisis Team Leader** | **Potential Backup** | +| ------------------- | --------------------- | ---------------------- | -------------------- | +| Critical infrastructure attack | Energy grid, water supply, telecommunications | Chief Operating Office | Logistics Chief | +| Environmental disaster | Earthquake, hurricane, volcano | Chief Resilience Officer | Safety Chief | +| Human resources crisis | Labor strike, protests, labor violation | Chief Human Resource Officer | Operations Chief | +| Marketing campaign failure | Typo, untrue product claim, wrong tone | Chief Digital Officer | Communications Chief | + + +Using PagerDuty, you can build your [on-call schedule](https://support.pagerduty.com/docs/schedule-basics) right inside the platform providing visibility and accountability about who’s on call for what area of the business if a crisis situation takes place. You can also add backups using an escalation policy that alerts the next person up after a custom time delay. + +![Set escalation timeouts](../assets/img/crisis/02_escalationtimeout.png) + + +If you want to balance the load for your on-call team, the [round robin scheduling](https://support.pagerduty.com/docs/round-robin-scheduling) can help by alternating who’s the primary team member that’s notified for each crisis notification. + +![Use round-robin scheduling](../assets/img/crisis/03_roundrobin.png) + +## Succession planning +As you examine the makeup of your Executive Crisis Leadership Team, Crisis Team Leaders and their backups, you should view it through the lens of succession planning or failover mapping. Depending on the makeup of your organization and geographical concentrations, you may want to further diversify your members to spread the risk. If everyone is positioned close together, an impact to that region will lead to failure and extended MTTRs. You will want your PagerDuty rotations and/or escalation policies to reflect this strategy. diff --git a/docs/crisis/operations.md b/docs/crisis/operations.md new file mode 100644 index 0000000..76fe37f --- /dev/null +++ b/docs/crisis/operations.md @@ -0,0 +1,43 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: Operationalizing your crisis plan begins by making practical changes to ensure you have what you need, in the way you need it, and at the time you need it. +--- + +## Operationalizing Your Crisis Plan + +Operationalizing your crisis plan begins by making practical changes to ensure you have what you need, in the way you need it, and at the time you need it. For example, your broader crisis management plan will be too cumbersome for your team to scan through for answers during a crisis situation. On the other hand, playbooks are more focused versions of your larger plan which make them easier to action, test and maintain. They’re also scenario-driven and provide you with specific parameters, considerations and tasks. + +Once you have these critical resources created, it can be difficult to centralize them and keep track of the most current version. PagerDuty makes this easy with the ability to add your runbooks, playbooks, policies and any other crisis response [documentation links](https://support.pagerduty.com/docs/service-profile#remediate) into your PagerDuty defined service(s). + +![Ensure that your PagerDuty services have links to their runbooks and documentation](../assets/img/crisis/04_remediationdocs.png) + +## Crisis Classification Scheme + +Waking up your Executive Crisis Leadership Team in the middle of the night with a PagerDuty alert should be a very rare occurrence. Having a [classification scheme](https://support.pagerduty.com/docs/incident-priority#establish-an-incident-classification-scheme) in place to rank the actual or anticipated materiality of an event will help you avoid a cry wolf scenario. A simple scale such as Low, Medium, High or Level 1, 2, 3 can be effective. + +Within PagerDuty, you can add your crisis “material impact levels” using the [incident priority](https://support.pagerduty.com/docs/incident-priority) feature. Remember that not all crises begin as a crisis. It may develop out of an ongoing incident so determining your thresholds for escalation ahead of time (e.g., 90 minutes without HVAC, 24 hours without direct contact, greater than $100k revenue at risk, etc.) is equally as important as the rankings. + +![Set and define priorities that make sense for your organization](../assets/img/crisis/05_priorities.png) + +Once you’ve defined your priorities, you can begin to leverage PagerDuty to automate parts of your crisis response through integrations and [incident workflows](https://support.pagerduty.com/docs/incident-workflows). You can integrate with Slack, Teams or Zoom for creating communications channels. You can auto-publish from templates to post on internal status pages. You can auto-initiate stakeholder alerts or [subscriptions](https://support.pagerduty.com/docs/communicate-with-stakeholders#add-subscribers-at-incident-creation), etc. + +![Use incident workflows to streamline response.](../assets/img/crisis/06_incidentworkflows.png) + +In a crisis situation, time savings are everything. Decreasing the mean time to respond and getting in touch with the right people is the most critical action your team can take at the onset of a crisis. + +## Crisis Declaration + +Does your crisis response team operate the same in a crisis as they do in normal business situations? Your answer should be no. Operating in a “crisis mode” should be distinctive because all actions and decisions are amplified, the tempo is quicker, the need for timely decisions is critical, the complexity of the problems are greater, the risks are higher, etc. + +The Crisis Team Leader needs to clearly and definitively signal that the modes of thinking and processing have shifted. What better way to signal that shift than through a PagerDuty alert? The incident priority feature is an easy way to make that declaration to the necessary stakeholders in a not so public way. Declaring the response as over is also important in transitioning to normal or new ways of doing things, which can be completed by [resolving the alert](https://support.pagerduty.com/docs/alerts#resolve-alerts) created on your crisis service(s) or posting to an internal status page. + +## Crisis Response Management Operations + +If you’ve followed along so far, you’ve essentially learned the ins and outs of a PagerDuty instance for crisis response. During your response, you don’t want to worry about how to contact the Crisis Team Leaders or which conference bridge you should be using or where your most up to date playbook is located. The operations side of things should just work. Aside from PagerDuty’s built-in alerting capabilities, the platform has 700+ [integrations](https://www.pagerduty.com/integrations/#Integrations-library) and more are possible through the API so you can bring your existing technology stack. + +[Adding integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service) to your service(s) for crisis response at the minimum should include an email integration, an instant messaging integration with Slack, Google Chat, etc. and a video conferencing tool such as Zoom, Microsoft Teams, etc. This standard grouping enables you to trigger alerts multiple ways (e.g., web, mobile, email, API and instant messaging) and alert or advise your Executive Crisis Leadership Team that something is up (e.g., PagerDuty alert via email, SMS, push or voice, automated group channel message and [subscribers](https://support.pagerduty.com/docs/communicate-with-stakeholders#subscribe-to-a-business-service) to a service). + +Given the scope of the [PagerDuty Operations Cloud](https://www.pagerduty.com/operations-cloud/), you’re likely not the only group within your organization running their operations through the platform. Your Customer Service organization may be using the platform alongside your Technical Operations organization. As a result, you’ll want to deploy some tradecraft as you trigger alerts, add notes and publish status pages to maintain the right level of privacy and compliance. + +![The PagerDuty Operations Cloud](../assets/img/crisis/07_operationscloud.png) + diff --git a/docs/crisis/pagerduty.md b/docs/crisis/pagerduty.md new file mode 100644 index 0000000..f71d8b2 --- /dev/null +++ b/docs/crisis/pagerduty.md @@ -0,0 +1,50 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: PagerDuty's Operations Cloud provides various tools and features that will help your organization manage crises effectively. +--- + +## PagerDuty Configuration +How to set up your Crisis Response Management instance in PagerDuty: + +[PagerDuty Mobile app](https://support.pagerduty.com/docs/mobile-app) - Ask each member to install and configure the mobile app for maximum reachability. + +[User Management](https://support.pagerduty.com/docs/users#add-users) - Make sure you’ve added your Executive Crisis Leadership and Crisis Response Team members to the system. + +[Contact information](https://support.pagerduty.com/docs/user-profile) - Ask each member to log into the web application and update their profile information including their phone, email and SMS contact information especially if they’ve changed devices. + +![PagerDuty user contact information settings](../assets/img/crisis/09_usercontactinfo.png) + +[Notification rules](https://support.pagerduty.com/docs/user-profile#notification-rules) - Ask each member to set their high urgency, low urgency, handoff and subscriber notification rules under their profile. + +![Use multiple contact methods for high urgency incidents](../assets/img/crisis/10_highurgencynotifications.png) + +[Teams](https://support.pagerduty.com/docs/teams) - Create teams for your Executive Crisis Leadership Team, each of your Crisis Team Leaders, and essential support functions like Crisis Communications, IT or Legal + +[Services](https://support.pagerduty.com/docs/services-and-integrations#create-a-service) - Create and configure a service for each of your crisis categories led by your Crisis Team Leaders, e.g., supply chain, human resources, critical infrastructure, geopolitics, physical security, etc. + +[Urgency](https://support.pagerduty.com/docs/service-settings#notification-urgency) - Set your notification urgency for each service whether high, low, dynamic or based on operating hours + +[Escalation policies](https://support.pagerduty.com/docs/escalation-policies#create-an-escalation-policy) - Decide who gets notified first and how long before the notification escalates to the next team member and configure round robin scheduling if you wish to alternate per crisis + +![Escalation policies determine which responders are contacted](../assets/img/crisis/11_escalationpolicy.png) + +[Integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service) - Add your instant messaging, video conferencing tool or create a custom email integration or connections to other systems for triggering alerts + +[Schedules](https://support.pagerduty.com/docs/schedule-basics#create-a-schedule) - Create your on-call rotations for the teams associated with each crisis service + +![Using multiple layers in schedules helps teams create full coverage](../assets/img/crisis/12_schedulelayers.png) + +[Incident Priority](https://support.pagerduty.com/docs/incident-priority) - Add your custom classification scheme for your crisis response escalation levels + +[Incident workflows](https://support.pagerduty.com/docs/incident-workflows) - Create your workflows for each crisis based on conditions such as priority, status and urgency using system templates or from scratch + +![Incident workflows can help with communication and coordination](../assets/img/crisis/13_incidentworkflows.png) + +[On-call readiness report](https://support.pagerduty.com/docs/on-call-readiness-reports) - Confirm that your teams are on-call ready and properly configured + +[Postmortem template](https://support.pagerduty.com/docs/postmortems#customize-the-postmortem-template) - Configure your postmortem template to fit your needs post-crisis + +[Status pages](https://support.pagerduty.com/docs/status-pages) - Configure your status page templates for internal stakeholders + +![Use status updates to communicate with stakeholders](../assets/img/crisis/14_incidentstatusupdates.png) + diff --git a/docs/crisis/prework.md b/docs/crisis/prework.md new file mode 100644 index 0000000..e9fa8c7 --- /dev/null +++ b/docs/crisis/prework.md @@ -0,0 +1,14 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: Take the time to stress test your work before the real world does it for you. Practice and simulations are crucial for flagging any gaps or blindspots you want to be aware of ahead of a crisis. This is also an opportunity to build strong leadership, reduce mean time to respond and develop good habits within your team. +--- + +## Pre-crisis +You now have your Executive Crisis Leadership team, your crisis response management configured in PagerDuty, your 100-page crisis management plan and a shorter scenario-driven playbook for crisis response. Now what? This is the time to stress test your work before the real world does it for you. This step is crucial for flagging any gaps or blindspots you want to be aware of ahead of a crisis. This is also an opportunity to build strong leadership, reduce mean time to respond and develop good habits within your team. + +## Crisis Simulations +Conducting discussion-based tabletop exercises with your team is an ideal starting point. However, leveraging functional exercises to simulate your level of maturity with crisis coordination, and command and control is also important. Running a crisis simulation using PagerDuty is as simple as triggering an alert on your crisis service—randomly if you really want to simulate real life. You would then follow your typical process of getting the right people on a conference call or instant messaging channel through an integration and running through a scenario with your corresponding playbook. + +The PagerDuty platform will automatically track the length of the exercise and record any notes or status changes in the timeline which you can then use in your [postmortem](https://postmortems.pagerduty.com/what_is/) (i.e, after action report or hotwash) and in developing further tabletops or simulations. + +A biannual cadence for crisis simulations provides sufficient time for preparation and to review the findings in the postmortem. diff --git a/docs/crisis/terms.md b/docs/crisis/terms.md new file mode 100644 index 0000000..bcdcd06 --- /dev/null +++ b/docs/crisis/terms.md @@ -0,0 +1,22 @@ +--- +cover: assets/img/crisis/cover_crisisresponse.png +description: If crisis response is new to you or your organization, these terms will help you establish a baseline for understanding. +--- + +The following list of terms will help you navigate the Crisis Response section of this document. You'll find some terms mirror those used in incident response, while some will have slightly different definitions. + +- **Crisis** - an extraordinary event or situation that serves as an existential threat to people, the operational environment, assets or reputation requiring a strategic, adaptive and timely organizational response +- **Crisis Management** - coordinated activities to lead, direct and control an organization with regards to its response to a crisis +- **Incident** - a potentially costly yet generally foreseeable event, e.g., availability, performance degradation, market volatility, increased competition, etc., that disrupts normal business processes +- **Emergent Issue** - a situational problem that does not present a significant impact to strategic objectives, reputation or viability of the organization but may require comment or organized activity to avert an incident or crisis +- **Emergency** - a sudden, short-lived threat generally involving people or physical assets that requires immediate action or assistance to mitigate +- **Incident Command System (ICS)** - a standardized emergency management construct created by the U.S. Federal Emergency Management Agency (FEMA) to provide an integrated organizational structure that reflects the complexity and demands of single or multiple incidents +- **Area Command** - an organizational structure used by the Federal Emergency Management Agency (FEMA) to oversee the management of multiple incidents or oversee the management of a very large or evolving situation with multiple ICS organizations +- **Incident Commander (IC)** - the individual responsible for incident management having overall authority and responsibility for conducting and managing response operations +- **Executive Crisis Leadership Team** - a group of Executives who make enterprise-wide strategic decisions for the business during a crisis situation and pre-crisis +- **Crisis Management Plan** - a document defining the personnel, procedures and resources needed to manage a crisis from beginning to end +- **Crisis Team Leader** - the equivalent of an Incident Commander for a crisis situation based on their functional area of expertise or charge +- **Crisis Response** - the immediate actions taken in response to a crisis situation to contain it and the steps taken pre-crisis to develop the capabilities and readiness to do so +- **Crisis Response Management** - the triage and initial treatment stages of a crisis situation and the effective activations to support the complete crisis management lifecycle +- **Incident Response** - the process for addressing and solving an incident of varying severity in order to limit damage and reduce recovery time and costs +- **Emergency Response** - the process of responding to a range of life-safety events and threats to the environment to mitigate damage and loss diff --git a/docs/getting_started.md b/docs/getting_started.md index 0b881b2..13d555b 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -1,11 +1,8 @@ --- cover: assets/img/covers/getting_started.png description: This 'Getting Started' guide will help you to navigate the most important parts of our process, and provide some guidelines about which bits we think you should start with. If you're just starting out with your own incident response process, this is a great way to know what order we think you should do things in. -hero: assets/img/headers/getting_started.jpg +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png hero_alt_text: Getting Started -hero_credit_url: https://www.pexels.com/photo/young-game-match-kids-2923/ -hero_credit_url_text: Pexels -hero_credit_text: Breakingpic --- If you don't yet have a process in your own organization, or if you're just starting out, you may find the sheer quantity of information in this documentation overwhelming. It's important to remember that this **isn't something you'll be able to implement overnight**. This is a process that should be built up over time. While it took us years to get to this point, our hope is that you can make use of this documentation to skip some of the awkward growing pains we went through and reach a more mature incident response process in the most efficient way possible. diff --git a/docs/index.md b/docs/index.md index 3e1f4dd..0150c35 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ --- cover: assets/img/covers/incident_response_docs.png -hero: assets/img/headers/pagerduty_ir.jpg +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png hero_alt_text: Incident Response at PagerDuty --- This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation used at PagerDuty for any major incidents and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after the incident. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. @@ -41,6 +41,17 @@ Our followup-processes, how we make sure we don't repeat mistakes, and are alway * [Postmortem Template](after/post_mortem_template.md) - _The template we use for writing our postmortems for major incidents._ * [Effective Postmortems](after/effective_post_mortems.md) - _A guide for writing effective postmortems._ +## Crisis Response + +Incident response is about more than dealing with technical incidents. A crisis can happen at any time. Are you ready for it? The way you handle your worst day will leave lasting impressions about your brand and its perceived value in the eyes of your current and potential customers. + +* [Introduction](crisis/crisis_intro.md) - _An introduction to crisis response and who this document is intended for._ +* [Terminology](crisis/terms.md) - _A list of key terms and concepts used in this guide._ +* [Crisis Leadership](crisis/leadership.md) - _Incorporating basic principles and your values in your response._ +* [Crisis Response Operations](crisis/operations.md) - _Activating your crisis response plans._ +* [Pre-crisis Phase](crisis/prework.md) - _Capitalizing on preparedness activities to keep your teams ready and engaged._ +* [PagerDuty for CRMOps](crisis/pagerduty.md) - _How PagerDuty leverages PagerDuty for crisis response management operations._ + ## Training So you want to learn about incident response? You've come to the right place. diff --git a/docs/oncall/being_oncall.md b/docs/oncall/being_oncall.md index 59d4348..3c0146c 100644 --- a/docs/oncall/being_oncall.md +++ b/docs/oncall/being_oncall.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/being_on-call.png description: A summary of the expectations and responsibilities of being on-call at PagerDuty, along with some best practice and etiquette recommendations. -hero: assets/img/headers/alert_fatigue.png -hero_alt_text: Alert Fatigue --- A summary of expectations and helpful information for being on-call. diff --git a/docs/oncall/whos_oncall.md b/docs/oncall/whos_oncall.md index a667a14..51bbf10 100644 --- a/docs/oncall/whos_oncall.md +++ b/docs/oncall/whos_oncall.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/whos_on-call.png description: Organizational structures vary, but these are general guidelines about the way different functions in a business relate to incident response. -hero: assets/img/headers/who_oncall.png -hero_alt_text: Who's On-Call? --- Organizational structures vary, but these are general guidelines about the way different functions in a business relate to incident response. diff --git a/docs/resources/reading.md b/docs/resources/reading.md index 045c456..0c94a2f 100644 --- a/docs/resources/reading.md +++ b/docs/resources/reading.md @@ -1,10 +1,8 @@ --- cover: assets/img/covers/reading.png description: This is a collection of additional reading on the topic of incident response that we've found useful. -hero: assets/img/headers/resources.jpg -hero_alt_text: Looking up information -hero_credit_url: https://www.publicdomainpictures.net/en/view-image.php?image=151506&picture=young-woman-my-computer -hero_credit_url_text: Axelle B +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png +hero_alt_text: Incident Response at PagerDuty --- This is a collection of additional reading on the topic of incident response that we've found useful. diff --git a/docs/training/customer_liaison.md b/docs/training/customer_liaison.md index 70f4c3c..e34874e 100644 --- a/docs/training/customer_liaison.md +++ b/docs/training/customer_liaison.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/customer_liaison.png description: So you want to be a customer liaison? You've come to the right place! -hero: assets/img/headers/status_page.jpg -hero_alt_text: PagerDuty Status Page --- So you want to be a Customer Liaison? You've come to the right place! diff --git a/docs/training/deputy.md b/docs/training/deputy.md index c4d1b2f..5f67d59 100644 --- a/docs/training/deputy.md +++ b/docs/training/deputy.md @@ -1,10 +1,6 @@ --- cover: assets/img/covers/deputy.png description: So you want to be a Deputy? You've come to the right place! -hero: assets/img/headers/incident_command_support.jpg -hero_alt_text: Deputy -hero_credit_url: https://www.flickr.com/photos/oregondot/8743801731/in/album-72157633494644719/ -hero_credit_url_text: oregondot @ Flickr --- So you want to be a Deputy? You've come to the right place! diff --git a/docs/training/incident_commander.md b/docs/training/incident_commander.md index 6549dfd..9dfffa4 100644 --- a/docs/training/incident_commander.md +++ b/docs/training/incident_commander.md @@ -1,10 +1,7 @@ --- cover: assets/img/covers/incident_commander.png description: So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern!) -hero: assets/img/headers/gene_kranz.jpg -hero_alt_text: Gene Kranz -hero_credit_url: https://en.wikipedia.org/wiki/File:Eugene_F._Kranz_at_his_console_at_the_NASA_Mission_Control_Center.jpg -hero_credit_url_text: NASA +hero: assets/img/headers/iStock-1097331490-3992x2242-e4f3f2d.png --- So you want to be an Incident Commander (IC)? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern!) diff --git a/docs/training/internal_liaison.md b/docs/training/internal_liaison.md index 7f7512b..4553ba3 100644 --- a/docs/training/internal_liaison.md +++ b/docs/training/internal_liaison.md @@ -1,8 +1,6 @@ --- cover: assets/img/covers/internal_liaison.png description: So you want to be an internal liaison? You've come to the right place! -hero: assets/img/headers/internal_liaison.jpg -hero_alt_text: Internal Liaison --- So you want to be an Internal Liaison? You've come to the right place! diff --git a/docs/training/scribe.md b/docs/training/scribe.md index 9264673..f652423 100644 --- a/docs/training/scribe.md +++ b/docs/training/scribe.md @@ -1,10 +1,6 @@ --- cover: assets/img/covers/scribe.png description: So you want to be a scribe? You've come to the right place! You don't need to be a senior team member to become a Deputy or Scribe, anyone can do it providing you have the requisite knowledge! -hero: assets/img/headers/fountain_pen.jpg -hero_alt_text: Scribe -hero_credit_url: https://www.pexels.com/photo/person-holding-fountain-pen-211291/ -hero_credit_url_text: John-Mark Smith --- So you want to be a Scribe? You've come to the right place! You don't need to be a senior team member to become a Deputy or Scribe, anyone can do it providing you have the requisite knowledge! diff --git a/docs/training/subject_matter_expert.md b/docs/training/subject_matter_expert.md index 0ece793..b2f3c03 100644 --- a/docs/training/subject_matter_expert.md +++ b/docs/training/subject_matter_expert.md @@ -1,10 +1,6 @@ --- cover: assets/img/covers/sme.png description: If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. -hero: assets/img/headers/incident_response.jpg -hero_alt_text: Incident Response -hero_credit_url: https://www.flickr.com/photos/oregondot/8743809853/in/album-72157633494644719/ -hero_credit_url_text: oregondot @ Flickr --- If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the [Incident Commander Training page](../training/incident_commander.md). diff --git a/mkdocs.yml b/mkdocs.yml index 237b30e..42b4100 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -45,8 +45,15 @@ nav: - Postmortem Process: 'after/post_mortem_process.md' - Postmortem Template: 'after/post_mortem_template.md' - Effective Postmortems: 'after/effective_post_mortems.md' + - Crisis Response: + - Crisis Response Introduction: 'crisis/crisis_intro.md' + - Terminology: 'crisis/terms.md' + - Crisis Leadership: 'crisis/leadership.md' + - Crisis Response Operations: 'crisis/operations.md' + - Pre-crisis Phase: 'crisis/prework.md' + - PagerDuty for CRMOps: 'crisis/pagerduty.md' - Training: - - Overview: 'training/overview.md' + - Training Overview: 'training/overview.md' - Incident Commander: 'training/incident_commander.md' - Deputy: 'training/deputy.md' - Scribe: 'training/scribe.md'