Skip to content
This repository has been archived by the owner on Jul 10, 2021. It is now read-only.

docs(zombies): REVERT Remove hydrate command, add zombie-specific commands #2163

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 68 additions & 6 deletions guides/runbooks/orca-zombie-executions.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,80 @@ If you've enabled the zombie check, set an alert on the metric `queue.zombies`,

# Remediation

You can run this command to cancel a zombie execution via the Orca admin API:
## Rehydrate the Queue

If the Execution is a zombie, there are no messages on the work queue for that Execution.
You can attempt to re-hydrate the queue --- reissue messages onto the work queue based on the last stored state --- using an [admin API in Orca](https://github.com/spinnaker/orca/blob/master/orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/admin/web/QueueAdminController.kt#L33), which must be called directly as it is not exposed through Gate.
This command can take either a single execution or operate on all executions within a time range.
**This command will dry-run by default.**
To actually rehydrate the queue, pass the query parameter `dryRun=false`.

```bash
$ curl -XPOST \
https://localhost:8083/admin/queue/hydrate?executionId=01CS076X85RX6MWBTQ0VGBF8VX&dryRun=false
```
POST /admin/queue/zombies/{executionId}:kill

This command is **best effort** and may not be able to rehydrate the Execution, especially if the Execution was zombied while running a non-retryable task.

An example response from the endpoint:

```json
{
"dryRun": false,
"executions": {
"01CS076X85RX6MWBTQ0VGBF8VX": {
"startTime": 1538679600852,
"actions": [
{
"description": "Task is running and is retryable",
"message": {
"kind": "runTask",
"executionType": "PIPELINE",
"executionId": "01CS076X85RX6MWBTQ0VGBF8VX",
"application": "myapplication",
"stageId": "01CS076X8501MNAD2ZTJ4ST2TM",
"taskId": "1",
"taskType": "com.netflix.spinnaker.orca.echo.pipeline.ManualJudgmentStage$WaitForManualJudgmentTask",
"attributes": [],
"ackTimeoutMs": 600000
},
"context": {
"stageId": "01CS076X8501MNAD2ZTJ4ST2TM",
"stageType": "manualJudgment",
"stageStartTime": 1538682406227,
"taskId": "1",
"taskType": "waitForJudgment",
"taskStartTime": 1538682406242
}
},
{
"description": "Task is running but is not retryable",
"context": {
"stageId": "01CS076X85ECXHF3FRWZBTQ359",
"stageType": "createProperty",
"stageStartTime": 1538681485559,
"taskId": "3",
"taskType": "monitorProperties",
"taskStartTime": 1538681546116
}
}
],
"canApply": false
}
}
}
```

There is also a blanket kill command, which takes a `minimumActivity` [Duration](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/Duration.html) query parameter (e.g. `PT1H` for 1 hour, the default).
This command should be used with caution, as zombie detection can result in false positives. There is no risk in letting a zombie live, so be safe!
It is not recommended to use a `minimumActivity` value less than 1 hour.
For each Execution, a final action summary is provided `canApply`.
If any part of an Execution cannot be re-hydrated, the entire Execution will be skipped.

## Cancel the Execution

If the Execution cannot be rehydrated, it will need to be canceled.
You can cancel the Execution via the UI or force cancellation via an Orca admin API:

```
POST /admin/queue/zombies:kill?minimumActivity=PT1H
PUT /admin/forceCancelExecution?executionId=01CS076X85RX6MWBTQ0VGBF8VX&executionType=PIPELINE
```

## Known Causes
Expand Down