Write and run API k6 load tests during staging API deployments #5000
Labels
💻 aspect: code
Concerns the software code in the repository
🌟 goal: addition
Addition of new feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: infra
Related to the Terraform config and other infrastructure
🔒 staff only
Restricted to staff members
Problem
We rely on staging to verify that new code deployments are zero-downtime safe. Part of zero-downtime safety necessarily requires the service to actually handle and respond to requests. Some migrations which are decided not zero-downtime safe, would never cause problems in a service that is not handling requests. That is because zero-downtime safety is precisely related to whether the service is able to handle and respond to requests when two versions of the application are running at the same time. Nothing is verified if the versions do not handle requests during that period.
Because the staging API does not see virtually any traffic, and certainly not consistent traffic, we cannot rely on staging deployments to verify zero-downtime safety from this perspective.
Description
Write k6 load tests that can run on a timer and exercise all non-deprecated media requests: search, thumbnail, waveform, related, single result. We will rely on HMAC request signing to bypass caching and rate limiting.
Ideally, the tests would also register new OAuth applications and make authenticated requests. However, there is currently no way to register and verify an application programmatically. Enable a new option in the staging API that auto-verifies OAuth applications if the request has a valid HMAC signature. Now we can add additional tests that exercise the authentication workflow and make authenticated requests. (This will require adding the HMAC signing secret as an environment variable to the staging API).
The tests must be able to run using one of the constant timed k6 executors (probably
constant-vus
but mayberamping-vus
if the staging API needs to warm up for request handling before the deployment rather than jumping straight to the peak traffic level of the test).Like the frontend k6 local tests, they should be executed against the local API in test during CI on pull requests.
Unlike the frontend k6 staging tests, which execute post deployment, the API tests will execute during deployment. Initiate the k6 tests as a parallel task to dispatching the staging deployment workflow. The staging API typically takes 8–10 minutes to deploy, so the k6 tests should execute for a sufficient period of time before and after the deployment to give a head and tail to the peak traffic levels in relation to the deployment period. For example, k6 could be started with at least 2 minutes before triggering the staging deployment, and allowed to run for 15 minutes total, resulting in a 2-minute head (+/- the time it takes the deployment GitHub Workflow to start and get to the point of deploying) and 5–7 minute tail of traffic compared to the deployment period.
Steps, to be done in separate PRs:
http.ts
utility. No work needs to be done to enable HMAC signing other than using the customhttp.ts
wrapper utility instead of k6'shttp
directly.Additional context
I've written this issue in response to a recent incident which highlighted the differences between staging and production as a vulnerability to our confidence in staging as a representative environment that we can trust to validate changes to the fullest possible extent.
The text was updated successfully, but these errors were encountered: