diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..45ddf0a
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+site/
diff --git a/.travis.yml b/.travis.yml
new file mode 100644
index 0000000..070f4ef
--- /dev/null
+++ b/.travis.yml
@@ -0,0 +1,24 @@
+language: python
+python:
+ - 2.7
+cache: pip
+install:
+ - pip install awscli
+ - pip install mkdocs
+ - pip install mkdocs-material
+script:
+ - mkdocs build --clean
+deploy:
+ on:
+ branch: master
+ repo: 'PagerDuty/incident-response-docs'
+ provider: s3
+ access_key_id: $AWS_ACCESS_KEY_ID
+ secret_access_key: $AWS_SECRET_ACCESS_KEY
+ bucket: $AWS_S3_BUCKET
+ skip_cleanup: true
+ local_dir: site
+ acl: public_read
+after_deploy:
+ # Delete any old files from the S3 bucket.
+ - aws s3 sync site/ s3://$AWS_S3_BUCKET --acl public-read --exclude "*.py*" --delete
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..95efe6e
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2016 PagerDuty, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..a85eea0
--- /dev/null
+++ b/README.md
@@ -0,0 +1,36 @@
+# PagerDuty Incident Response Documentation [data:image/s3,"s3://crabby-images/d9488/d948824dfaf312fddbbd681ac59d6fdbee899abe" alt="Build Status"](https://travis-ci.com/PagerDuty/incident-response-docs)
+This is a public version of the Incident Response process used at PagerDuty. It is also used to prepare new employees for on-call responsibilities, and provides information not only on preparing for an incident, but also what to do during and after. See the [about page](docs/about.md) for more information on what this documentation is and why it exists.
+
+You can view the documentation [directly](/docs/index.md) in this repository, or rendered as a website at https://response.pagerduty.com.
+
+[data:image/s3,"s3://crabby-images/7780b/7780bf5fdc62063fc7cedf30d951e05b1aee3831" alt="PagerDuty Incident Response Documentation"](https://response.pagerduty.com)
+
+## Development
+We use [MkDocs](http://www.mkdocs.org/) to create a static site from this repository. For local development,
+
+1. [Install MkDocs](http://www.mkdocs.org/#installation). `pip install mkdocs`
+1. Install the [MkDocs Material theme](https://github.com/squidfunk/mkdocs-material). `pip install mkdocs-material`
+1. To test locally, run `mkdocs serve` from the project directory.
+
+## Deploying
+1. Run `mkdocs build --clean` to produce the static site for upload.
+1. Upload the `site` directory to S3 (or wherever you would like it to be hosted).
+
+ aws s3 sync ./site/ s3://[BUCKET_NAME] \
+ --acl public-read \
+ --exclude "*.py*" \
+ --delete
+
+## License
+[Apache 2](http://www.apache.org/licenses/LICENSE-2.0) (See [LICENSE](LICENSE) file)
+
+## Contributing
+Thank you for considering contributing! If you have any questions, just ask - or submit your issue or pull request anyway. The worst that can happen is we'll politely ask you to change something. We appreciate all friendly contributions.
+
+Here is our preferred process for submitting a pull request,
+
+1. Fork it ( https://github.com/PagerDuty/incident-response-docs/fork )
+1. Create your feature branch (`git checkout -b my-new-feature`)
+1. Commit your changes (`git commit -am 'Add some feature'`)
+1. Push to the branch (`git push origin my-new-feature`)
+1. Create a new Pull Request.
diff --git a/docs/about.md b/docs/about.md
new file mode 100644
index 0000000..fa11240
--- /dev/null
+++ b/docs/about.md
@@ -0,0 +1,29 @@
+This site documents parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after.
+
+Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone.
+
+## What is this?
+
+A collection of pages detailing how to efficiently deal with any major incidents that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.
+
+## Who is this for?
+
+It is intended for on-call practitioners and those involved in an operational incident response process, or those wishing to enact a formal incident response process.
+
+## Why do I need it?
+
+Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Major Incidents Application](https://www.pagerduty.com/applications/#major-incidents-application) can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time.
+
+## What is covered?
+
+Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of [severities](/before/severity_levels.md), incident [call etiquette](/before/call_etiquette.md), all the way to how to run a [post-mortem](/after/post_mortem_process.md), and providing our [post-mortem template](/after/post_mortem_template.md). We even include our [security incident response process](/during/security_incident_response.md).
+
+## What is missing?
+
+This isn't an exact clone of our internal documentation, but instead has some information removed. Things such as our phone bridge numbers, names of internal tools and systems which are not (yet) open sourced, images of our dashboards, etc. Basically anything that is specific to PagerDuty or is too sensitive to share.
+
+## License
+
+This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.
+
+Whether you are a PagerDuty customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.
diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md
new file mode 100644
index 0000000..bf763cd
--- /dev/null
+++ b/docs/after/post_mortem_process.md
@@ -0,0 +1,91 @@
+For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.
+
+data:image/s3,"s3://crabby-images/e09bf/e09bf59c110d11b88f290565aab7d7481eca7ea0" alt="Post-Mortem"
+
+## Owner Designation
+The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,
+
+## Owner Responsibilities
+As owner of a post-mortem, you are responsible for the following,
+
+* Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
+* Updating the page with all of the necessary content.
+* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.
+* Creating follow-up JIRA tickets (_You are only responsible for creating the tickets, not following them up to resolution_).
+* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_).
+* In cases where we need a public blog post, creating & reviewing it with appropriate parties.
+
+## Post-Mortem Wiki Page
+Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.
+
+1. (If not already done by the IC) Create a new post-mortem page for the incident.
+
+1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.
+ * Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
+
+1. Begin populating the page with all of the information you have.
+ * The timeline should be the main focus to begin with.
+ * The timeline should include important changes in status/impact, and also key actions taken by responders.
+ * You should mark the start of the incident in red, and the resolution in green (for when we went into/outof SEV).
+ * Go through the history in Slack to identify the responders, and add them to the page.
+ * Identify the Incident Commander and Scribe in this list.
+
+1. Populate the page with more detailed information.
+ * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
+
+1. Perform an analysis of the incident.
+ * Capture all available data regarding the incident. What caused it, how many customers were affected, etc.
+ * Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered.
+ * Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)
+ * Identify the underlying cause of the incident (What happened, and why did it happen).
+
+1. Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),
+ * Go through the history in Slack to identify any TODO items.
+ * Label all tickets with their severity level and date tags.
+ * Any actions which can reduce re-occurrence of the incident.
+ * (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).
+ * Identify any actions which can make our incident response process better.
+ * Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.
+
+1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.
+ * Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
+ * Look at other examples of previous post-mortems to see the kind of thing you should send.
+
+## Post-Mortem Meeting
+These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.
+
+You should invite the following people to the post-mortem meeting,
+
+* Always
+ * The incident commander.
+ * Service owners involved in the incident.
+ * Key engineer(s)/responders involved in the incident.
+* Optional
+ * Customer liaison. (Only SEV-1 incidents)
+
+A general agenda for the meeting would be something like,
+
+1. Recap the timeline, to make sure everyone agrees and is on the same page.
+1. Recap important points, and any unusual items.
+1. Discuss how the problem could've been caught.
+ * Did it show up in canary?
+ * Could it have been caught in tests, or loadtest environment?
+1. Discuss customer impact. Any comments from customers, etc.
+1. Review action items that have been created, discuss if appropriate, or if more are needed, etc.
+
+## Examples
+Here are some examples of post-mortems from other companies as a reference,
+
+* [Stripe](https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc)
+* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice.html/comment-page-2/)
+* [AWS](https://aws.amazon.com/message/5467D2/)
+* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html)
+* [Heroku](https://status.heroku.com/incidents/151)
+* [Netflix](http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html)
+* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016)
+* [A List of Post-mortems!](https://github.com/danluu/post-mortems)
+
+## Useful Resources
+
+* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011)
+* [Blame. Language. Sharing.](http://fractio.nl/2015/10/30/blame-language-sharing/)
diff --git a/docs/after/post_mortem_template.md b/docs/after/post_mortem_template.md
new file mode 100644
index 0000000..781e410
--- /dev/null
+++ b/docs/after/post_mortem_template.md
@@ -0,0 +1,79 @@
+This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section.
+
+---
+
+!!! note "Guidelines"
+ This page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event.
+ Your first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident.
+ Don't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting.
+
+** Post-Mortem Owner:** _Your name goes here._
+
+** Meeting Scheduled For:** _Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here._
+
+** Call Recording:** _Link to the incident call recording._
+
+## Overview
+_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_
+
+## What Happened
+_Include a short description of what happened._
+
+## Root Cause
+_Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process._
+
+## Resolution
+_Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution._
+
+## Impact
+_Be very specific here, include exact numbers._
+
+| Time in SEV-1 | ?mins |
+| Time in SEV-2 | ?mins |
+| Notifications Delivered out of SLA | ??% (?? of ??) |
+| Events Dropped / Not Accepted | ??% (?? of ??) _Should usually be 0, but always check_ |
+| Accounts Affected | ?? |
+| Users Affected | ?? |
+| Support Requests Raised | ?? _Include any relevant links to tickets_ |
+
+## Responders
+
+* _Who was the IC?_
+* _Who was the scribe?_
+* _Who else was involved?_
+* _Who else was involved?_
+
+## Timeline
+_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at._
+
+| Time (UTC) | Event | Data Link |
+| ---------- | ----- | --------- |
+
+## How'd We Do?
+
+### What Went Well?
+
+* _List anything you did well and want to call out. It's OK to not list anything._
+
+### What Didn't Go So Well?
+
+* _List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes._
+
+## Action Items
+_Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._
+
+## Messaging
+
+### Internal Email
+_This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page._
+
+> Briefly summarize what happened and where the post-mortem page (this page) can be found.
+
+### External Message
+_This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_
+
+> Summary
+
+> What Happened?
+
+> What Are We Doing About This?
diff --git a/docs/assets/css/extra.css b/docs/assets/css/extra.css
new file mode 100644
index 0000000..01e30dc
--- /dev/null
+++ b/docs/assets/css/extra.css
@@ -0,0 +1,399 @@
+/* Colfax Font */
+@font-face {
+ font-family: 'Colfax Regular';
+ font-style: normal;
+ font-weight: 400;
+ src: local('ColfaxRegular'), url(https://www.pagerduty.com/wp-content/themes/startit-child/fonts/ColfaxWebRegular.woff) format('woff2');
+}
+
+@font-face {
+ font-family: 'Colfax Light';
+ font-style: normal;
+ font-weight: 100;
+ src: local('ColfaxRegular'), url(https://www.pagerduty.com/wp-content/themes/startit-child/fonts/ColfaxWebLight.woff) format('woff2');
+}
+
+/* Defaults */
+body {
+ font-weight: 500;
+ -webkit-font-smoothing: antialiased;
+}
+
+/* Change the colour theme to better match PagerDuty */
+
+/* background: pd-green */
+.repo a {
+ background: #25c151;
+}
+
+@media only screen and (max-width: 959px) {
+ .palette-primary-green .project {
+ background: #25c151;
+ }
+}
+
+/* background: pd-navy */
+.palette-primary-green,
+.palette-primary-green .footer,
+.palette-primary-green .header,
+.palette-primary-green .results .meta,
+.palette-primary-green .article table th {
+ background: #1f293a;
+}
+
+.palette-primary-green .article table th {
+ background: #555;
+}
+
+/* font: pd-green */
+.palette-primary-green .article h1,
+.palette-primary-green .article h2,
+.palette-primary-green .drawer .toc a.current,
+.palette-primary-green .drawer .toc a:focus,
+.palette-primary-green .drawer .toc a:hover,
+.palette-primary-green .article a:hover {
+ color: #25c151;
+}
+
+/* font: pd-navy */
+.palette-primary-green .article a,
+.palette-primary-green .article code,
+.palette-primary-green .article h1,
+.palette-primary-green .article h2 {
+ color: #1f293a;
+}
+
+/* Selected nav section */
+.palette-primary-green .drawer .anchor a {
+ border-left: 3px solid #25c151;
+}
+
+/* Hide the page title, already in the navbar */
+.article h1 {
+ display: none;
+}
+
+/* But show it when printing */
+@media print {
+ .article h1 {
+ display: block;
+ padding-top: 0em;
+ padding-bottom: 0.1em;
+ margin-top: 0em;
+ margin-bottom: 0em;
+ border-bottom: none;
+ }
+
+ /* Also add a heading when printing */
+ .article h1:before {
+ background: url(/assets/img/logo.png) 0em -0.07em no-repeat;
+ background-size: 7em;
+ display: block;
+ height: 2em;
+ width: 100%;
+ padding-left: 7.2em;
+ content: 'Incident Response';
+ border-bottom: 1px solid #ddd;
+ margin-bottom: 0.6em;
+ }
+}
+
+
+/* Want the font to be bigger for articles, easier reading. */
+.article {
+ font-size: 1.45em;
+}
+
+/* Too much whitespace at the top, not enough at bottom */
+.article .wrapper {
+ padding: 56px 16px 132px !important;
+}
+
+@media only screen and (min-width: 720px) {
+ .article .wrapper {
+ padding: 70px 24px 126px !important;
+ }
+}
+
+/* Get rid of the whitespace when printing, let people set own margins. */
+@media print {
+ .article .wrapper {
+ padding: 0em !important;
+ }
+}
+
+ul, ol {
+ padding-left: 1em;
+}
+
+/* Expanding border menu */
+.drawer .toc li a {
+ overflow: hidden;
+ position: relative;
+}
+
+.drawer .toc li a:before {
+ display: block;
+ content: '';
+ position: absolute;
+ height: 2em;
+ left: 0px;
+ top: 0.5em;
+ border-left: 5px solid #25c151;
+ transform: scaleY(0);
+ transition: transform 250ms ease-in-out;
+}
+
+.drawer .toc li a:hover:before {
+ transform: scaleY(1);
+}
+
+/* Don't do it on active menu items */
+.drawer .toc a.current:hover:before,
+.drawer .toc li.anchor a:hover:before {
+ transform: scaleY(0);
+ display: none;
+}
+
+/* Don't overflow horizontally on nav */
+.drawer .toc ul li a {
+ white-space: nowrap;
+ text-overflow: ellipsis;
+}
+
+/* Change the title bar to include the PD logo */
+nav div.mainlogo {
+ width: 15em;
+ display: table-cell;
+}
+
+nav div.mainlogo a {
+ min-height: 3.5em;
+ margin-bottom: -1.25em;
+ width: 14.5em;
+
+ background: url(/assets/img/logo.png) 0em 0.1em no-repeat;
+ background-size: contain;
+}
+
+nav div.mainlogo img {
+ display: none;
+}
+
+/* Admonition */
+.admonition {
+ background: #25c151;
+}
+.admonition.info {
+ background: #f5a623;
+}
+
+@media print {
+ .admonition {
+ padding: 1em 2em !important;
+ }
+}
+
+/* Typography */
+h4 {
+ font-weight: bold;
+ text-decoration: underline;
+}
+
+.project .logo+.name {
+ font-size: 13px;
+}
+
+span.bad {
+ color: #f00;
+}
+
+span.good {
+ color: #008800;
+}
+
+span.code,
+code {
+ font-family: monospace;
+ color: #00f !important;
+ border-radius: 2px;
+ padding: 0.1em;
+ border: 1px solid #eee;
+ background: #f4f4f4;
+}
+
+/* Icons */
+.button .icon:hover {
+ transition: color 250ms ease-in-out;
+ color: #25c151;
+}
+
+/* Images */
+.article .wrapper {
+ overflow: hidden;
+}
+
+/* Center all images */
+.article img {
+ display: block;
+ margin: 0 auto;
+}
+
+/* Header images */
+.article h1 + p + p img {
+ max-width: 110%;
+ margin-left: -2em;
+}
+
+/* Image Captions */
+img + em {
+ position: relative;
+ font-size: 0.8em;
+ margin-right: -2.3em;
+ padding: 0em 1em;
+ float: right;
+ margin-top: -2.1em;
+ color: #000;
+ border-top-left-radius: 3px;
+ background: rgba(255, 255, 255, 0.7);
+}
+
+/* Fixes for smaller screen sizes */
+@media only screen and (max-width: 720px) {
+ .article h1 + p + p img {
+ max-width: 120%;
+ }
+
+ .article h1 + p + p img + em {
+ margin-right: -1.4em;
+ margin-top: -2em;
+ }
+}
+
+/* Hack to hide the header images when printing. */
+@media print {
+ .article h1 + p + p img {
+ display: none;
+ }
+
+ .article h1 + p + p img + em {
+ display: none;
+ }
+}
+
+/* Quotes */
+.article blockquote {
+ border-left: 3px solid #555;
+ background: #f9f9f9;
+ padding: 1em;
+ padding-left: 16px;
+ margin-top: 1em;
+ color: #333;
+ font-style: italic;
+}
+
+.article blockquote p {
+ margin: 0em;
+ padding: 0.5em 0em;
+}
+
+/* Horizontal Rules */
+.article hr {
+ margin-top: 2em;
+ border-top: 2px solid #f4f4f4;
+}
+
+/* Don't care about copyright notice for this project, Apache License. */
+aside.copyright {
+ display: none;
+}
+
+/* Custom tables */
+table.custom-table td ul {
+ margin-top: -0.8em;
+ padding-top: 0px;
+ padding-left: 0px;
+}
+
+table.custom-table td.warning {
+ font-weight: bold;
+ text-align: center;
+ color: #f00;
+ background: #f4f4f4;
+}
+
+table.custom-table td.sev-1 {
+ background: #ffe7e7;
+ color: #f00;
+ font-weight: bold;
+}
+
+table.custom-table td.sev-2 {
+ background: #ffd;
+ color: rgb(255,153,0);
+ font-weight: bold;
+}
+
+table.custom-table td.sev-3 {
+ background: #e0f0ff;
+ color: rgb(51,102,255);
+ font-weight: bold;
+}
+
+table.custom-table td.sev-4 {
+ background: #f0f0f0;
+ color: rgb(128,128,128);
+ font-weight: bold;
+}
+
+table.custom-table td.sev-5 {
+ background: #ddfade;
+ color: rgb(0,128,0);
+ font-weight: bold;
+}
+
+table.custom-table td.centered {
+ text-align: center;
+}
+
+/* Embeds */
+iframe {
+ display: block;
+ margin: 0 auto;
+ margin-top: 1em;
+}
+
+/* Contact summary table */
+#contact-summary {
+ margin-bottom: -2em;
+ background: #fff;
+ color: #000;
+}
+
+/* Super horrible hack to get the training PDF images correct */
+#national-incident-management-system-nims ~ p img {
+ display: inline;
+}
+#national-incident-management-system-nims ~ p:nth-of-type(6) {
+ text-align: center;
+}
+
+/* 404 Page */
+#error {
+ text-align: center;
+ padding: 0em 5em;
+}
+
+#error h1 {
+ display: block;
+ font-size: 2.5em;
+ padding-bottom: 1em;
+ margin-bottom: 1em;
+ margin-top: 1em;
+ border-bottom: 1px solid #eee;
+}
+
+#error p {
+ font-style: italic;
+ color: #555;
+}
diff --git a/docs/assets/img/cover.png b/docs/assets/img/cover.png
new file mode 100644
index 0000000..6325795
Binary files /dev/null and b/docs/assets/img/cover.png differ
diff --git a/docs/assets/img/headers/gene_kranz.jpg b/docs/assets/img/headers/gene_kranz.jpg
new file mode 100644
index 0000000..773fb29
Binary files /dev/null and b/docs/assets/img/headers/gene_kranz.jpg differ
diff --git a/docs/assets/img/headers/incident_command_support.jpg b/docs/assets/img/headers/incident_command_support.jpg
new file mode 100644
index 0000000..eed6180
Binary files /dev/null and b/docs/assets/img/headers/incident_command_support.jpg differ
diff --git a/docs/assets/img/headers/incident_response.jpg b/docs/assets/img/headers/incident_response.jpg
new file mode 100644
index 0000000..e45fc8a
Binary files /dev/null and b/docs/assets/img/headers/incident_response.jpg differ
diff --git a/docs/assets/img/headers/obama_phone.jpg b/docs/assets/img/headers/obama_phone.jpg
new file mode 100644
index 0000000..79b4a77
Binary files /dev/null and b/docs/assets/img/headers/obama_phone.jpg differ
diff --git a/docs/assets/img/headers/pagerduty_ir.jpg b/docs/assets/img/headers/pagerduty_ir.jpg
new file mode 100644
index 0000000..00b6114
Binary files /dev/null and b/docs/assets/img/headers/pagerduty_ir.jpg differ
diff --git a/docs/assets/img/headers/pagerduty_post_mortem.jpg b/docs/assets/img/headers/pagerduty_post_mortem.jpg
new file mode 100644
index 0000000..7561025
Binary files /dev/null and b/docs/assets/img/headers/pagerduty_post_mortem.jpg differ
diff --git a/docs/assets/img/headers/typewriter.jpg b/docs/assets/img/headers/typewriter.jpg
new file mode 100644
index 0000000..37dfbbc
Binary files /dev/null and b/docs/assets/img/headers/typewriter.jpg differ
diff --git a/docs/assets/img/icon.png b/docs/assets/img/icon.png
new file mode 100644
index 0000000..13c222a
Binary files /dev/null and b/docs/assets/img/icon.png differ
diff --git a/docs/assets/img/logo.png b/docs/assets/img/logo.png
new file mode 100644
index 0000000..a679d2a
Binary files /dev/null and b/docs/assets/img/logo.png differ
diff --git a/docs/assets/img/misc/ack.png b/docs/assets/img/misc/ack.png
new file mode 100644
index 0000000..e83bd50
Binary files /dev/null and b/docs/assets/img/misc/ack.png differ
diff --git a/docs/assets/img/misc/alert_fatigue.png b/docs/assets/img/misc/alert_fatigue.png
new file mode 100644
index 0000000..6a3c823
Binary files /dev/null and b/docs/assets/img/misc/alert_fatigue.png differ
diff --git a/docs/assets/img/misc/communicate.png b/docs/assets/img/misc/communicate.png
new file mode 100644
index 0000000..c708c51
Binary files /dev/null and b/docs/assets/img/misc/communicate.png differ
diff --git a/docs/assets/img/misc/escalation.png b/docs/assets/img/misc/escalation.png
new file mode 100644
index 0000000..c8595be
Binary files /dev/null and b/docs/assets/img/misc/escalation.png differ
diff --git a/docs/assets/img/misc/incident_response_roles.png b/docs/assets/img/misc/incident_response_roles.png
new file mode 100644
index 0000000..a250b42
Binary files /dev/null and b/docs/assets/img/misc/incident_response_roles.png differ
diff --git a/docs/assets/img/misc/mobile_alerts.png b/docs/assets/img/misc/mobile_alerts.png
new file mode 100644
index 0000000..225b77c
Binary files /dev/null and b/docs/assets/img/misc/mobile_alerts.png differ
diff --git a/docs/assets/img/misc/oncall_burnout.png b/docs/assets/img/misc/oncall_burnout.png
new file mode 100644
index 0000000..fe39e67
Binary files /dev/null and b/docs/assets/img/misc/oncall_burnout.png differ
diff --git a/docs/assets/img/misc/schedule.png b/docs/assets/img/misc/schedule.png
new file mode 100644
index 0000000..b7d1f7a
Binary files /dev/null and b/docs/assets/img/misc/schedule.png differ
diff --git a/docs/assets/img/misc/triage.png b/docs/assets/img/misc/triage.png
new file mode 100644
index 0000000..223fe68
Binary files /dev/null and b/docs/assets/img/misc/triage.png differ
diff --git a/docs/assets/img/screenshots/high_business_hours.png b/docs/assets/img/screenshots/high_business_hours.png
new file mode 100644
index 0000000..fe06d21
Binary files /dev/null and b/docs/assets/img/screenshots/high_business_hours.png differ
diff --git a/docs/assets/img/screenshots/high_urgency.png b/docs/assets/img/screenshots/high_urgency.png
new file mode 100644
index 0000000..5efa9f9
Binary files /dev/null and b/docs/assets/img/screenshots/high_urgency.png differ
diff --git a/docs/assets/img/screenshots/low_urgency.png b/docs/assets/img/screenshots/low_urgency.png
new file mode 100644
index 0000000..15c54f3
Binary files /dev/null and b/docs/assets/img/screenshots/low_urgency.png differ
diff --git a/docs/assets/img/screenshots/suppressed.png b/docs/assets/img/screenshots/suppressed.png
new file mode 100644
index 0000000..dc9910b
Binary files /dev/null and b/docs/assets/img/screenshots/suppressed.png differ
diff --git a/docs/assets/img/thumbnails/nims_core.png b/docs/assets/img/thumbnails/nims_core.png
new file mode 100644
index 0000000..48bce4c
Binary files /dev/null and b/docs/assets/img/thumbnails/nims_core.png differ
diff --git a/docs/assets/img/thumbnails/nims_training.png b/docs/assets/img/thumbnails/nims_training.png
new file mode 100644
index 0000000..0025545
Binary files /dev/null and b/docs/assets/img/thumbnails/nims_training.png differ
diff --git a/docs/before/call_etiquette.md b/docs/before/call_etiquette.md
new file mode 100644
index 0000000..e376956
--- /dev/null
+++ b/docs/before/call_etiquette.md
@@ -0,0 +1,50 @@
+You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of.
+
+data:image/s3,"s3://crabby-images/7d8cd/7d8cdf552239b76ac23a2c2c59afdebcf19829cb" alt="Obama phone"
+*Credit: [Official White House Photo](https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg) by Pete Souza*
+
+## First Steps
+
+* If you intend on participating on the incident call you should join both the call, and Slack.
+* Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum.
+* Keep your microphone muted until you have something to say.
+* Identify yourself when you join the call; State your name and the system you are the expert for.
+* Speak up and speak clearly.
+* Be direct and factual.
+* Keep conversations/discussions short and to the point.
+* Bring any concerns to the Incident Commander (IC) on the call.
+* Respect time constraints given by the Incident Commander.
+
+## Lingo
+**Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.**
+
+data:image/s3,"s3://crabby-images/6c1ac/6c1ac532d32b890f967c13fda719afdde2338234" alt="Communication"
+
+Standard radio [voice procedure](https://en.wikipedia.org/wiki/Voice_procedure#Words_in_voice_procedure) does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are,
+
+* **Ack/Rog** - "I have received and understood"
+* **Say Again** - "Repeat your last message"
+* **Standby** - "Please wait a moment for the next response"
+* **Wilco** - "Will comply"
+
+Do not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation.
+
+## The Commander
+The Incident Commander (IC) is the leader of the incident response process, and are responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking.
+
+* Follow all instructions from the incident commander, without exception.
+* Do not perform any actions unless the incident commander has told you to do so.
+* The commander will typically poll for any strong objections before performing a large action. This is you time to raise any objections if you have them.
+* Once the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll.
+* Answer any questions the commander asks you in a clear and concise way.
+ * Answering that you "don't know" something is perfectly acceptable. Do not try to guess.
+* The commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time.
+ * Answering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time.
+
+## Problems?
+
+#### There's no incident commander on the call! I don't know what to do!
+Ask on the call if an IC is present. If you have no response, type `!ic page` in Slack. This will page the primary and backup IC to the call.
+
+#### I can join the call or Slack, but not both, what should I do?
+You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.
diff --git a/docs/before/different_roles.md b/docs/before/different_roles.md
new file mode 100644
index 0000000..73be681
--- /dev/null
+++ b/docs/before/different_roles.md
@@ -0,0 +1,131 @@
+There are several main roles for our incident response teams at PagerDuty. Certain roles only have one person per incident (e.g. IC), whereas other roles can have multiple people (e.g. Subject Matter Expert, SME). It's all about coming together as a team, working the problem, and getting a solution quickly.
+
+Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page.
+
+data:image/s3,"s3://crabby-images/21885/21885f64db2303e786dc93efaee1e527ec2d6051" alt="Incident Response Structure"
+
+---
+
+## Incident Commander (IC)
+
+### What is it?
+An Incident Commander acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors.
+
+### Why have one?
+As any software system grows in size and complexity, things break and cause incidents. The Incident Commander is needed to help drive major incidents to resolution.
+
+### What are the responsibilities?
+1. Help prepare for major incidents,
+ * Setup communications channels for major incidents.
+ * Funnel people to these communications channels when there is a major incident.
+ * Train team members on how to communicate during major incidents and train other Incident Commanders.
+1. Drive major incidents to resolution,
+ * Get everyone on the same communication channel.
+ * Collect information from team members for their services/area of ownership status.
+ * Collect proposed repair actions, then recommend repair actions to be taken.
+ * Delegate all repair actions, the Incident Commander is NOT a resolver.
+ * Be the single authority on system status
+1. Post Mortem,
+ * Creating the initial template right after the incident so people can put in their thoughts while fresh.
+ * Assigning the post-mortem after the event is over, this can be done after the call.
+ * Work with Team Leads/Managers on scheduling preventive actions.
+
+### Who are they?
+Anyone on the Incident Commander on-call schedule. Trainees are typically on the Incident Commander Shadow schedule.
+
+### How can I become one?
+Take a look at our [Incident Commander training guide](/training/incident_commander.md).
+
+---
+
+## Deputy
+
+### What is it?
+A Deputy is a direct support role for the Incident Commander. This is not a shadow where the person just observes, the Deputy is expected to perform important tasks during an incident.
+
+### Why have one?
+It's important for the IC to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The deputy helps to support the IC and keep them focussed on the incident.
+
+### What are the responsibilities?
+The Deputy is expected to:
+
+1. Bring up issues to the Incident Commander that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc).
+1. Be a "hot standby" Incident Commander, should the primary need to either transition to a SME, or otherwise have to step away from the IC role.
+1. Page SME's or other on-call engineers as instructed by the Incident Commander.
+1. Manage the incident call, and be prepared to remove people from the call if instructed by the Incident Commander.
+1. Liaise with stakeholders and provide status updates on Slack as necessary.
+
+### Who are they?
+Any Incident Commander can act as a deputy. Deputies need to be trained as an Incident Commander as they may be required to take over command.
+
+### How can I become one?
+Take a look at our [Deputy training guide](/training/deputy.md). Deputies also need to be [trained as an Incident Commander](/training/incident_commander.md).
+
+---
+
+## Scribe
+
+### What is it?
+A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review.
+
+### Why have one?
+The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time.
+
+### What are the responsibilities?
+The Scribe is expected to:
+
+1. Ensure the incident call is being recorded.
+1. Note in Slack important data, events, and actions, as they happen. Specifically:
+ * Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock")
+ * Status reports when one is provided by the IC (Example: "We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes")
+ * Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.")
+
+### Who are they?
+Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible.
+
+### How can I become one?
+Follow our [Scribe training guide](/training/scribe.md), and then notify the Incident Commanders that you would like to be considered for scribing for the next incident.
+
+---
+
+## Subject Matter Expert
+
+### What is it?
+Follow our Scribe training guide, and then notify the Incident Commanders in #incident-commanders that you would like to be considered for scribing for the next incident.
+
+### Why have one?
+The IC and deputy are not all-knowing super beings. When there is a problem with a service, an expert in that service is needed to be able to quickly help identify and fix issues.
+
+### What are the responsibilities?
+1. Being able to diagnose common problems with the service.
+1. Being able to rapidly fix issues found during an incident.
+1. Concise communication skills, specifically for CAN reports:
+ * Condition: What is the current state of the service? Is it healthy or not?
+ * Actions: What actions need to be taken if the service is not in a healthy state?
+ * Needs: What support does the resolver need to perform an action?
+
+### Who are they?
+Anyone who is considered a "domain expert" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service.
+
+### How can I become one?
+Take a look at our [Subject Matter Expert training guide](/training/subject_matter_expert.md). You should also discuss with your team and service owner to determine what the requirements are for your particular service.
+
+---
+
+## Customer Liaison
+
+### What is it?
+A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team.
+
+### Why have one?
+All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs.
+
+### What are the responsibilities?
+1. Post any publicly facing messages regarding the incident (Twitter, StatusPage, etc).
+1. Notify the IC of any customers reporting that they are affected by the incident.
+
+### Who are they?
+Any member of the Support Team can act as a customer liaison.
+
+### How can I become one?
+Discuss with the Support Team about becoming our next customer liaison.
diff --git a/docs/before/severity_levels.md b/docs/before/severity_levels.md
new file mode 100644
index 0000000..b02770b
--- /dev/null
+++ b/docs/before/severity_levels.md
@@ -0,0 +1,89 @@
+The first step in any incident response process is to determine what actually constitutes an incident. Generally this is done by using "SEV" definitions, with lower numbered severities being more urgent. Operational issues can be classified at one of these severity levels, and in general you are able to take more risky moves to resolve a higher severity issue. Anything above a SEV-3 is considered a "major incident" and gets a more intensive response than a normal incident.
+
+!!! note "Always Assume The Worst"
+ If you are unsure which level an incident is (e.g. not sure if SEV-2 or SEV-1), **treat it as the higher one**. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.
+
+
+
+
+
Severity
+
Description
+
What To Do
+
+
+
+
+
SEV-1
+
+
+
The system is in a critical state and is actively impacting a large number of customers.
+
Functionality has been severely impaired for a long time, breaking SLA.
+
Customer-data-exposing security vulnerability has come to our attention.
Anything above this line is considered a "Major Incident". A call is triggered for all major incidents.
+
+
+
SEV-3
+
+
+
Partial loss of functionality, only affecting minority of customers.
+
Something that has the likelihood of becoming a SEV-2 if nothing is done.
+
No redundancy in a service (failure of 1 more node will cause outage).
+
+
+
+
+
Work on issue as your top priority.
+
Liaise with engineers of affected systems to identify cause.
+
If related to recent deployment, rollback.
+
Monitor status and notice if/when it escalates.
+
Mention on Slack if you think it has the potential to escalate.
+
+
+
+
+
SEV-4
+
+
+
Performance issues (delays, etc). Tasks that require non-immediate attention.
+
Job failure (not impacting alerting).
+
+
+
+
+
Work on the issue as your first priority (above "normal" tasks).
+
Monitor status and notice if/when it escalates.
+
+
+
+
+
SEV-5
+
+
+
Normal bugs which aren't impacting system use, cosmetic issues, etc.
+
+
+
+
+
Create a JIRA ticket and assign to owner of affected system.
+
+
+
+
+
+
+!!! note "Be Specific"
+ These severity descriptions have been changed from the PagerDuty internal definitions to be more generic. For your own documentation, you are encouraged to make your definitions very specific, usually referring to a % of users/accounts affected. You will usually want your severity definitions to be metric driven.
diff --git a/docs/during/during_an_incident.md b/docs/during/during_an_incident.md
new file mode 100644
index 0000000..49a711e
--- /dev/null
+++ b/docs/during/during_an_incident.md
@@ -0,0 +1,111 @@
+Information on what to do during a major incident. See our [severity level descriptions](/before/severity_levels.md) for what constitutes a major incident.
+
+!!! note "Documentation"
+ For your own internal documentation, you should make sure that this page has all of the necessary information prominently displayed. Such as: phone bridge numbers, Slack rooms, important chat commands, etc. Here is an example,
+
+