Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _articles/accounts.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ This page lists various services that Login.gov team uses to do work.
- NPM Package registry
- OpsGenie
- search.gov dashboard
- StatusPage

[onboarding]: {% link _articles/onboarding.md %}
[offboarding]: {% link _articles/offboarding.md %}
Expand Down
3 changes: 2 additions & 1 deletion _articles/incident-response-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ For detailed information see the [Security Incident Response Guide]({% link _art
* Issue created as official record for incident: [Incident Template](https://github.com/18F/identity-security-private/issues/new?template=incidents.md)
* Incident Review document created from [Incident Review Google Doc](https://docs.google.com/document/d/1Yaqnb9QsHRrlaBvlTeO_qHGmuP-0h4z-CCustU8gBdk/copy) and moved to the year's subfolder under the [Incident Reviews Folder](https://drive.google.com/drive/folders/1ZdroGfCbGmeUPuCqiR8BetUhEXRfk4ui?usp=sharing)
* Used [GSA IR Email Template](https://docs.google.com/document/d/16h4gDq9JeW8JBhBDswSvoGRWx6qQvX_4spyEZVbjlcA) to create and send notice to GSA Incident Response <gsa-ir@gsa.gov>, IT Service Desk <itservicedesk@gsa.gov> (or GSA IT Helpline called), and our [GSA ISSO and ISSM](https://github.com/18F/identity-devops/wiki/On-Call-Guide-Quick-Reference/#emergency-contacts) **within 1 hour** of start of incident
* Posts initial incident notice on StatusPage following [StatusPage Process - Managing an Outage]({% link _articles/statuspage-process.md %}#managing-an-outage)
* **Every 30 minutes** ensures StatusPage and external stakeholders are updated
* **Every 30 minutes** notifies Login.gov comms if the incident reaches 50% of the "Length of time" limit for the type of incident in the [Incident Response Thresholds for Communications](https://docs.google.com/document/d/19LfFyjlUeM2bbcztaMCswFm68FL5X51zzG1yNMQapz0/edit?skip_itp2_check=true&pli=1)

Expand All @@ -52,7 +53,7 @@ For detailed information see the [Security Incident Response Guide]({% link _art
- **High**: Confirmed PII breach, confirmed security penetration, complete outage
- **Medium**: Suspected PII breach, suspected security penetration, partial outage
- **Low**: Suspected attack, outage of non-prod persistent system (`int`)
* If user or partner impacting, [StatusPage updated](https://manage.statuspage.io/login) notice posted using one of the pre-made `Outage` templates if applicable
* If user or partner impacting, [StatusPage Process - Managing an Outage]({% link _articles/statuspage-process.md %}#managing-an-outage) followed to publish notice
* Checked [Incident Response Runbooks](https://github.com/18F/identity-devops/wiki/Incident-Response-Runbooks) for relevant runbooks to execute
* If secure shared notepad is needed, Google Doc opened and shared <https://drive.google.com/drive/folders/1TWTMp_w55niNuqC7vTPDEe5vkxaiP4P0> (Contents should be copied to official issue)

Expand Down
8 changes: 4 additions & 4 deletions _articles/secops-incident-response-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Roles proceed as follows:
* Creates the official tracking issue for the incident: [Incident Template](https://github.com/18F/identity-security-private/issues/new?template=incidents.md)
* Creates the Incident Review document by copying [Incident Review Google Doc](https://docs.google.com/document/d/1Yaqnb9QsHRrlaBvlTeO_qHGmuP-0h4z-CCustU8gBdk/copy) and shares a link in #login-situation
* Uses [GSA IR Email Template](https://docs.google.com/document/d/16h4gDq9JeW8JBhBDswSvoGRWx6qQvX_4spyEZVbjlcA) to create and send notice to GSA Incident Response <gsa-ir@gsa.gov>, IT Service Desk <itservicedesk@gsa.gov> (or GSA IT Helpline called), and our [GSA ISSO and ISSM](https://github.com/18F/identity-devops/wiki/On-Call-Guide-Quick-Reference/#emergency-contacts) **within 1 hour** of start of incident
* If incident is an outage (problem impacting users' ability to use Login.gov), SL updates the [Login.gov Statuspage](https://logingov.statuspage.io/) via the [Statuspage Admin Interface](https://manage.statuspage.io/login) ([View Sample Message]({{site.baseurl}}/images/statuspage-sample-message.png){:target="_blank"})
* If incident is an outage (problem impacting users' ability to use Login.gov), SL updates the [Login.gov StatusPage](https://logingov.statuspage.io/) following [StatusPage Process - Managing an Outage]({% link _articles/statuspage-process.md %}#managing-an-outage)
* Checks the incident against the [Incident Response Thresholds for Communications](https://docs.google.com/document/d/19LfFyjlUeM2bbcztaMCswFm68FL5X51zzG1yNMQapz0/edit?skip_itp2_check=true&pli=1) and notify Login.gov comms before the incident reaches 50% of its length of time limit


Expand Down Expand Up @@ -152,7 +152,7 @@ At this phase, communications should follow these steps (and any additional step
* Real-time chat should happen in [#login-situation](https://gsa-tts.slack.com/messages/login-situation/).
* Create an issue in the [identity-security-private](https://github.com/18F/identity-security-private/issues/new?template=incidents.md) GitHub repository.
* Create a google docs
* If incident is an outage SL updates the [Login.gov Statuspage](https://logingov.statuspage.io/) via the [Statuspage Admin Interface](https://manage.statuspage.io/login) ([View Sample Message]({{site.baseurl}}/images/statuspage-sample-message.png){:target="_blank"})
* If incident is an outage SL updates the [Login.gov StatusPage](https://logingov.statuspage.io/) following [StatusPage Process - Managing an Outage]({% link _articles/statuspage-process.md %}#managing-an-outage)
* Check the incident against the [Incident Response Thresholds for Communications](https://docs.google.com/document/d/19LfFyjlUeM2bbcztaMCswFm68FL5X51zzG1yNMQapz0/edit?skip_itp2_check=true&pli=1) and notify Login.gov comms before the incident reaches 50% of its length of time limit
* Login.gov Agency Partners: send out an incident summary to LOGIN-PARTNERS@listserv.gsa.gov. Partner list: https://drive.google.com/drive/u/0/folders/0B4yIa0Upv1JJSkJOSmdsLWVOVmM)

Expand Down Expand Up @@ -199,7 +199,7 @@ This sitrep should be:

#### Comms at the Assess phase

Updates and real-time chat should continue as above (updates on the GitHub issue, chat in Slack or Google Hangouts, and update to open Statuspage incident if applicable).
Updates and real-time chat should continue as above (updates on the GitHub issue, chat in Slack or Google Hangouts, and update to open StatusPage incident if applicable).

### Remediate Phase

Expand Down Expand Up @@ -239,7 +239,7 @@ Comms at the Remediate phase

* The SL should continue to post updated sitreps on a regular cadence (the section on severities, below, suggests cadences for each level). These sitreps should be sent to Slack, to GSA-IT and US-CERT via email, and to any other stakeholders identified throughout the process (e.g. clients).

* For user impacting incidents, users must be kept up to date via the [Login.gov Statuspage](https://logingov.statuspage.io/) ([Statuspage Admin Interface](https://manage.statuspage.io/login))
* For user impacting incidents, users must be kept up to date via the [Login.gov StatusPage](https://logingov.statuspage.io/) following [StatusPage Process - Managing an Outage]({% link _articles/statuspage-process.md %}#managing-an-outage)

### Retrospective Phase

Expand Down
223 changes: 223 additions & 0 deletions _articles/statuspage-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
---
title: "StatusPage Update Process"
description: "Publishing outage and maintenance information to StatusPage"
layout: article
category: Team
subcategory: Guides
---

Our public facing status page is: [status.login.gov](https://status.login.gov)

A high level overview of StatusPage management follows. See
[support.atlassian.com/statuspage/resources/](https://support.atlassian.com/statuspage/resources/)
for full product documentation.

## Components

The following components are published to [status.login.gov](https://status.login.gov):

* **Login (secure.login.gov)** - Our production IdP service including authentication and identity verification services
* **Brochure site (login.gov)** - Our informational website
* **Customer Support** (group)
* **Customer Support Online Form** - Our online customer support request form
* **Customer Support Phone Line** - Our customer support phone line

Each component can be managed individually to share information to the public
and partners.

## StatusPage Admins

The list of StatusPage Admins is
available in the [Handbook Appendix](https://docs.google.com/document/d/1ZMpi7Gj-Og1dn-qUBfQHqLc1Im7rUzDmIxKn11DPJzk/edit#heading=h.1c3ohc5eqn5r)

You can ask for help from a StatusPage admin by using the Slack group `@login-statuspagers`.

The remaining content is for StatusPage admins using the [StatusPage Manager][statuspage-manager].

## What to Share and What Not to Share

StatusPage is a public resource. It is important to provide transparency without
oversharing. Using [templates](#template-management) is advised to avoid having
to create language under duress.

Do:
* Use plain language
* Explain how our users (the public) and our agency partners are impacted
* Highlight what works and what does not
* Focus on functionality and availability

Do Not:
* Share security details
* Share the name of any vendor or service provider
* Promise a time to recover service

## Managing an Outage

### Start Outage Incident

Login to the [StatusPage Manager][statuspage-manager] then:
* Ensure the **Login.gov** page is selected
* Under **Incidents** click **Create incident**
* Use the **Apply template** dropdown on the top right and select an appropriate
template from the **OUTAGE** list
* Refine the **Incident name** as needed
* Set the **Incident status** to the option that best describes where we are in the IR process
* Refine the **Message** as needed
* Ensure the affected component(s) are checked
* Change the status from **Operational** to the current status
* Degraded Performance - Slow response or intermittent errors
* Partial Outage - Some functionality unavailable
* Major Outage - All or most functionality unavailable
* Ensure **Send notifications** is checked
* **PROOF READ THE INCIDENT NAME AND MESSAGE** - You are about to send notification
to thousands of people!
* Click **Create** to post the incident to StatusPage and send notifications

### Update

The incident should be updated when:
* The status changes (e.g.: Moving from Investigating to Identified when the cause
of the outage has been identified)
* When the operational status of the service(s) changes (e.g: Moving from Partial Outage
to Degraded Performance)
* Every 30 minutes for a Major or Partial Outage, even if it is just to say
"Login.gov is continuing to work to restore service"

To update the incident:
* If not already in the incident navigate to **Incidents** and click on it
* Change the **incident status** if appropriate
* Enter the **message**
* Change the availability if appropriate
* **PROOF READ**
* Click **Update** to post and send the update

### End

Status should be change to **monitoring** with an availability of **Operational**
for at the following time minimums before closing an incident:

* **Major Outage** or outage where things "mysteriously fixed themselves": 30 minutes
* **Partial Outage** or **Degraded Service**: 15 minutes

Once the appropriate time has passed with no issues you can close the incident.

* Change the **Incident status** to **Resolved**
* Enter a message like "Service has been functioning normally for over X minutes. We consider this issue resolved."
* **PROOF READ**
* Click **Update** to close the incident and send notification

## Managing a Maintenance Window

Planned maintenance can be anything from maintenance that is anticipated to be
non-disruptive to a full complete outage window.

### Scheduling Maintenance

14 calendar days of advanced notice should be provided
prior to maintenance. Work with the Partnerships team to ensure additional
partner communication if maintenance must be performed with less than 14 days
notice.

Where possible the recommended change window should be used for maintenance.
See [Runbook: Maintenance Window Tasks](https://github.com/18F/identity-devops/wiki/Runbook:-Maintenance-Window-Tasks)
for the suggested time window. It is recommended that you reach out to the
Partnerships team before scheduling maintenance in production, and that you
do the same for our `sandbox` (integration testing) environment.

Once the window has been selected, login to the [StatusPage Manager][statuspage-manager] and:
* Click "Incidents" on the left menu and then select the "Maintenances" tab in the center top list
* Click "Schedule maintenance"
* Click the "Apply template" pull down and look for an applicable maintenance type
* Make sure the "Maintenance name" starts with the text `[Planned Maintenance]`
and accurately represents what users will experience
* Enter the maintenance window start date and time in **Scheduled Time**, minding
the listed timezone (Eastern Time)
* Select the duration of the window using the **for** hours and minutes input
* Update the message section:
* Include a "Maintenance Window" section that has the correct start and end dates listed for common timezones - You can use one of these templates:
~~~
# Standard Time template
Maintenance Window:
UTC: YYYY-MM-DD 06:00 to 09:30
Eastern: YYYY-MM-DD 1:00AM to 04:30AM
Central: YYYY-MM-DD 12:00AM to 03:30AM
Mountain: YYYY-MM-DD-1 11:00PM to YYYY-MM-DD 02:30AM
Pacific: YYYY-MM-DD-1 10:00PM to YYYY-MM-DD 01:30AM

# Daylight Savings Time template
Maintenance Window:
UTC: YYYY-MM-DD 05:00 to 08:30
Eastern: YYYY-MM-DD 1:00AM to 04:30AM
Central: YYYY-MM-DD 12:00AM to 03:30AM
Mountain: YYYY-MM-DD-1 11:00PM to YYYY-MM-DD 02:30AM
Pacific: YYYY-MM-DD-1 10:00PM to YYYY-MM-DD 01:30AM
~~~
* Ensure only the Component affected is selected: "Login (secure.login.gov)" for our main IdP
* Leave notification check boxes as is
* **BEFORE CLICKING SCHEDULE NOW**:
* **PROOF READ** - Are you sure everything reads correctly?
* Double check the schedule date/time and ensure it aligns with the **Maintenance Window** text
in the **Message** box
* Click "Schedule now" to post the maintenance on the status page and send notifications

### Start

StatusPage will automatically post the scheduled maintenance to the page
and send notifications at the start of the maintenance window.

### Exceeding Window

Note that StatusPage will auto-close the incident
once the window has ran its defined duration.

If maintenance is not going to plan and you need to exceed the window,
login to the [StatusPage Manager][statuspage-manager] and:
* Under **Incidents** click on the open maintenance incident
* Select the **Schedule & Automation** tab
* Uncheck **Set status to completed** under **At the end of time for this maintenance**
* Click **Update**

Remember that you will need to manually close the incident once maintenance is
complete.

### End

Once work is complete and service has been fully restored you can close
the maintenance incident before the end of the window. This is always
recommended to ensure the public knows they can resume using Login.gov.

Login to the [StatusPage Manager][statuspage-manager] and:
* Under **Incidents** click on the open maintenance incident
* Change the status to **Completed**
* In **Message** enter **Maintenance has been completed and all systems are functioning normally.**
* Click **Update** to close the incident, mark services as Operational, and send notifications

## Template Management

Templates should be used wherever possible for incidents and maintenance.
When developing a new template reach out to Login.gov communications for help
refining and streamlining messaging.

See [StatusPage - Incident template](https://support.atlassian.com/statuspage/docs/create-an-incident-template/)
for more on templates.


## Correcting Uptime Reporting

StatusPage is integrated with NewRelic to provide request, latency, and uptime
information automatically. At times the NewRelic Synthetics monitor used to
determine uptime of `secure.login.gov` and `login.gov` may produce a false
positive alarm and mark us as down.

In the case of a false positive we can update StatusPage to reflect accurate
uptime.

* Verify that traffic levels and availability were normal during the time in
question
* Confirm your findings with platform or engineering leadership
* Follow instructions in [Changing component status outside of an incident](https://support.atlassian.com/statuspage/docs/what-is-a-component/#Changing-component-status-outside-of-an-incident)
to update the specific time frame to accurately represent availability

Always err on the side of caution with any availability publishing adjustment.

[statuspage-manager]: https://manage.statuspage.io