Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions INCIDENT_RESPONSE_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Incident Response Plan (IRP)

## Scope

This IRP covers incidents affecting Node.js web properties and supporting services operated by the **@nodejs/web** team.

For a list of covered services and repositories, refer to [PERMISSIONS.md](./PERMISSIONS.md).

## IC & Escalation

* **Incident Commander (IC):** Any `@nodejs/web` member who first takes charge.

**Escalation:**
IC → `@nodejs/web-infra` → `@nodejs/web-admins` → `@nodejs/build` (Cloudflare account/zone-critical) and/or `@nodejs/security-wg` (security incidents) -> `@nodejs/tsc`.

## Severity Levels & SLAs

* **P0 – Critical user impact** (global outage/defacement/security breach):

* Acknowledge: TBD

* **P1 – Major degradation** (partial outage, broken downloads/docs on a locale/route):

* Acknowledge: TBD

* **P2 – Minor** (noncritical errors, single integration down):

* Acknowledge: TBD

When in doubt, start at higher severity and downgrade later.

## Canonical Response Workflow

1. **Declare** severity; assign IC and Comms Lead.

2. **Stabilize users first:**
* Roll back to last good deploy
* If needed, enable Cloudflare “Under Attack/WAF rules” and emergency caching on critical paths.

3. **Communicate:** post an initial status summary and known impact; repeat per SLA. (Use blog/announcements or org channel as appropriate; precedent: public [post-mortem for March 17 incident](https://nodejs.org/en/blog/announcements/node-js-march-17-incident).

4. **Contain & eradicate:** revoke keys/tokens, disable compromised deploy hooks, patch, and purge caches safely.

5. **Recover:** redeploy clean artifact, validate, then progressively relax mitigations.

6. **Review:** draft a blameless post-mortem, impact, root cause, and follow-up engineering actions \+ process fixes

## Common Incidents — What Happens & What They Cause

| Incident | Likely Cause | What users see | Immediate actions | Primary owner |
| ----------------------------------- | ------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ---------------------------- |
| **Token/secret leak** | Accidental commit or exposed CI logs. | Subsequent unauthorized changes/deploys. | Invalidate in provider; rotate in 1Password; hunt for usage in audit logs; force redeploy clean. | Service owner + Web-Admins. |
| **Expired TLS/SSL certificate** | Missed renewal or misconfigured auto-renew. | Browser warnings (“Connection not secure”), failed API calls. | Renew/redeploy certificate; validate chain; confirm monitoring alerts. | Infra + Build. |
| **Outage due to misconfigured DNS** | Incorrect DNS update or provider outage. | Users can’t reach service; domain not resolving. | Roll back DNS change; verify propagation; coordinate with DNS provider. | Infra + Build. |
| **Compromised admin account** | Phishing or weak MFA. | Unauthorized changes in systems. | Disable account; rotate credentials; audit changes; notify security. | Security WG + Account owner. |

## Communications

**Internal (private):** `@nodejs/web` or `@nodejs/web-infra` channel/thread; if Cloudflare account action is required, loop in `@nodejs/build`.

**Public (as needed):** short status updates; if user impact was material, publish a brief blog post or addendum to an incident page (example precedent exists).

### Notes on authority & ownership

* Cloudflare account-level actions (e.g., role changes) are coordinated with **@nodejs/build**; Web-Infra holds write/admin depending on team (`web-infra` vs `web-admins`). Keep this in mind when planning mitigations that require account scope.
Loading