Initial import of PagerDuty IR docs for open-sourcing.

PagerDuty · Nov 28, 2016 · 5916e56 · 5916e56
commit 5916e56
Show file tree

Hide file tree

Showing 53 changed files with 2,256 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+site/
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,24 @@
+language: python
+python:
+  - 2.7
+cache: pip
+install:
+  - pip install awscli
+  - pip install mkdocs
+  - pip install mkdocs-material
+script:
+  - mkdocs build --clean
+deploy:
+  on:
+    branch: master
+    repo: 'PagerDuty/incident-response-docs'
+  provider: s3
+  access_key_id: $AWS_ACCESS_KEY_ID
+  secret_access_key: $AWS_SECRET_ACCESS_KEY
+  bucket: $AWS_S3_BUCKET
+  skip_cleanup: true
+  local_dir: site
+  acl: public_read
+after_deploy:
+  # Delete any old files from the S3 bucket.
+  - aws s3 sync site/ s3://$AWS_S3_BUCKET --acl public-read --exclude "*.py*" --delete
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2016 PagerDuty, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/README.md b/README.md
@@ -0,0 +1,36 @@
+# PagerDuty Incident Response Documentation [![Build Status](https://travis-ci.com/PagerDuty/incident-response-docs.svg?token=zdc1SxQUyY3TG9TLD3Xz&branch=master)](https://travis-ci.com/PagerDuty/incident-response-docs)
+This is a public version of the Incident Response process used at PagerDuty. It is also used to prepare new employees for on-call responsibilities, and provides information not only on preparing for an incident, but also what to do during and after. See the [about page](docs/about.md) for more information on what this documentation is and why it exists.
+
+You can view the documentation [directly](/docs/index.md) in this repository, or rendered as a website at https://response.pagerduty.com.
+
+[![PagerDuty Incident Response Documentation](screenshot.png)](https://response.pagerduty.com)
+
+## Development
+We use [MkDocs](http://www.mkdocs.org/) to create a static site from this repository. For local development,
+
+1. [Install MkDocs](http://www.mkdocs.org/#installation). `pip install mkdocs`
+1. Install the [MkDocs Material theme](https://github.com/squidfunk/mkdocs-material). `pip install mkdocs-material`
+1. To test locally, run `mkdocs serve` from the project directory.
+
+## Deploying
+1. Run `mkdocs build --clean` to produce the static site for upload.
+1. Upload the `site` directory to S3 (or wherever you would like it to be hosted).
+
+        aws s3 sync ./site/ s3://[BUCKET_NAME] \
+          --acl public-read \
+          --exclude "*.py*" \
+          --delete
+
+## License
+[Apache 2](http://www.apache.org/licenses/LICENSE-2.0) (See [LICENSE](LICENSE) file)
+
+## Contributing
+Thank you for considering contributing! If you have any questions, just ask - or submit your issue or pull request anyway. The worst that can happen is we'll politely ask you to change something. We appreciate all friendly contributions.
+
+Here is our preferred process for submitting a pull request,
+
+1. Fork it ( https://github.com/PagerDuty/incident-response-docs/fork )
+1. Create your feature branch (`git checkout -b my-new-feature`)
+1. Commit your changes (`git commit -am 'Add some feature'`)
+1. Push to the branch (`git push origin my-new-feature`)
+1. Create a new Pull Request.
diff --git a/docs/about.md b/docs/about.md
@@ -0,0 +1,29 @@
+This site documents parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after.
+
+Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone.
+
+## What is this?
+
+A collection of pages detailing how to efficiently deal with any major incidents that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.
+
+## Who is this for?
+
+It is intended for on-call practitioners and those involved in an operational incident response process, or those wishing to enact a formal incident response process.
+
+## Why do I need it?
+
+Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Major Incidents Application](https://www.pagerduty.com/applications/#major-incidents-application) can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time.
+
+## What is covered?
+
+Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of [severities](/before/severity_levels.md), incident [call etiquette](/before/call_etiquette.md), all the way to how to run a [post-mortem](/after/post_mortem_process.md), and providing our [post-mortem template](/after/post_mortem_template.md). We even include our [security incident response process](/during/security_incident_response.md).
+
+## What is missing?
+
+This isn't an exact clone of our internal documentation, but instead has some information removed. Things such as our phone bridge numbers, names of internal tools and systems which are not (yet) open sourced, images of our dashboards, etc. Basically anything that is specific to PagerDuty or is too sensitive to share.
+
+## License
+
+This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.
+
+Whether you are a PagerDuty customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.
diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md
@@ -0,0 +1,91 @@
+For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.
+
+![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg)
+
+## Owner Designation
+The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,
+
+## Owner Responsibilities
+As owner of a post-mortem, you are responsible for the following,
+
+* Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
+* Updating the page with all of the necessary content.
+* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.
+* Creating follow-up JIRA tickets (_You are only responsible for creating the tickets, not following them up to resolution_).
+* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_).
+* In cases where we need a public blog post, creating & reviewing it with appropriate parties.
+
+## Post-Mortem Wiki Page
+Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.
+
+1. (If not already done by the IC) Create a new post-mortem page for the incident.
+
+1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.
+    * Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
+
+1. Begin populating the page with all of the information you have.
+    * The timeline should be the main focus to begin with.
+        * The timeline should include important changes in status/impact, and also key actions taken by responders.
+        * You should mark the start of the incident in red, and the resolution in green (for when we went into/outof SEV).
+    * Go through the history in Slack to identify the responders, and add them to the page.
+        * Identify the Incident Commander and Scribe in this list.
+
+1. Populate the page with more detailed information.
+    * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
+
+1. Perform an analysis of the incident.
+    * Capture all available data regarding the incident. What caused it, how many customers were affected, etc.
+    * Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered.
+    * Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)
+    * Identify the underlying cause of the incident (What happened, and why did it happen).
+
+1. Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),
+    * Go through the history in Slack to identify any TODO items.
+    * Label all tickets with their severity level and date tags.
+    * Any actions which can reduce re-occurrence of the incident.
+        * (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).
+    * Identify any actions which can make our incident response process better.
+    * Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.
+
+1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.
+    * Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
+    * Look at other examples of previous post-mortems to see the kind of thing you should send.
+
+## Post-Mortem Meeting
+These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.
+
+You should invite the following people to the post-mortem meeting,
+
+* Always
+    * The incident commander.
+    * Service owners involved in the incident.
+    * Key engineer(s)/responders involved in the incident.
+* Optional
+    * Customer liaison. (Only SEV-1 incidents)
+
+A general agenda for the meeting would be something like,
+
+1. Recap the timeline, to make sure everyone agrees and is on the same page.
+1. Recap important points, and any unusual items.
+1. Discuss how the problem could've been caught.
+    * Did it show up in canary?
+    * Could it have been caught in tests, or loadtest environment?
+1. Discuss customer impact. Any comments from customers, etc.
+1. Review action items that have been created, discuss if appropriate, or if more are needed, etc.
+
+## Examples
+Here are some examples of post-mortems from other companies as a reference,
+
+* [Stripe](https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc)
+* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice.html/comment-page-2/)
+* [AWS](https://aws.amazon.com/message/5467D2/)
+* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html)
+* [Heroku](https://status.heroku.com/incidents/151)
+* [Netflix](http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html)
+* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016)
+* [A List of Post-mortems!](https://github.com/danluu/post-mortems)
+
+## Useful Resources
+
+* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011)
+* [Blame. Language. Sharing.](http://fractio.nl/2015/10/30/blame-language-sharing/)
diff --git a/docs/after/post_mortem_template.md b/docs/after/post_mortem_template.md
@@ -0,0 +1,79 @@
+This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section.
+
+---
+
+!!! note "Guidelines"
+    This page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event.
+    Your first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident.
+    Don't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting.
+
+** Post-Mortem Owner:** _Your name goes here._
+
+** Meeting Scheduled For:** _Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here._
+
+** Call Recording:** _Link to the incident call recording._
+
+## Overview
+_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_
+
+## What Happened
+_Include a short description of what happened._
+
+## Root Cause
+_Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process._
+
+## Resolution
+_Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution._
+
+## Impact
+_Be very specific here, include exact numbers._
+
+| Time in SEV-1 | ?mins |
+| Time in SEV-2 | ?mins |
+| Notifications Delivered out of SLA | ??% (?? of ??) |
+| Events Dropped / Not Accepted | ??% (?? of ??) _Should usually be 0, but always check_ |
+| Accounts Affected | ?? |
+| Users Affected | ?? |
+| Support Requests Raised | ?? _Include any relevant links to tickets_ |
+
+## Responders
+
+* _Who was the IC?_
+* _Who was the scribe?_
+* _Who else was involved?_
+* _Who else was involved?_
+
+## Timeline
+_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at._
+
+| Time (UTC) | Event | Data Link |
+| ---------- | ----- | --------- |
+
+## How'd We Do?
+
+### What Went Well?
+
+* _List anything you did well and want to call out. It's OK to not list anything._
+
+### What Didn't Go So Well?
+
+* _List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes._
+
+## Action Items
+_Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags:  “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._
+
+## Messaging
+
+### Internal Email
+_This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page._
+
+> Briefly summarize what happened and where the post-mortem page (this page) can be found.
+
+### External Message
+_This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_
+
+> Summary
+
+> What Happened?
+
+> What Are We Doing About This?