diff --git a/guides/runbooks/orca-zombie-executions.md b/guides/runbooks/orca-zombie-executions.md index 5de1814be1..ee8391f4f8 100644 --- a/guides/runbooks/orca-zombie-executions.md +++ b/guides/runbooks/orca-zombie-executions.md @@ -36,18 +36,80 @@ If you've enabled the zombie check, set an alert on the metric `queue.zombies`, # Remediation -You can run this command to cancel a zombie execution via the Orca admin API: +## Rehydrate the Queue +If the Execution is a zombie, there are no messages on the work queue for that Execution. +You can attempt to re-hydrate the queue --- reissue messages onto the work queue based on the last stored state --- using an [admin API in Orca](https://github.com/spinnaker/orca/blob/master/orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/admin/web/QueueAdminController.kt#L33), which must be called directly as it is not exposed through Gate. +This command can take either a single execution or operate on all executions within a time range. +**This command will dry-run by default.** +To actually rehydrate the queue, pass the query parameter `dryRun=false`. + +```bash +$ curl -XPOST \ + https://localhost:8083/admin/queue/hydrate?executionId=01CS076X85RX6MWBTQ0VGBF8VX&dryRun=false ``` -POST /admin/queue/zombies/{executionId}:kill + +This command is **best effort** and may not be able to rehydrate the Execution, especially if the Execution was zombied while running a non-retryable task. + +An example response from the endpoint: + +```json +{ + "dryRun": false, + "executions": { + "01CS076X85RX6MWBTQ0VGBF8VX": { + "startTime": 1538679600852, + "actions": [ + { + "description": "Task is running and is retryable", + "message": { + "kind": "runTask", + "executionType": "PIPELINE", + "executionId": "01CS076X85RX6MWBTQ0VGBF8VX", + "application": "myapplication", + "stageId": "01CS076X8501MNAD2ZTJ4ST2TM", + "taskId": "1", + "taskType": "com.netflix.spinnaker.orca.echo.pipeline.ManualJudgmentStage$WaitForManualJudgmentTask", + "attributes": [], + "ackTimeoutMs": 600000 + }, + "context": { + "stageId": "01CS076X8501MNAD2ZTJ4ST2TM", + "stageType": "manualJudgment", + "stageStartTime": 1538682406227, + "taskId": "1", + "taskType": "waitForJudgment", + "taskStartTime": 1538682406242 + } + }, + { + "description": "Task is running but is not retryable", + "context": { + "stageId": "01CS076X85ECXHF3FRWZBTQ359", + "stageType": "createProperty", + "stageStartTime": 1538681485559, + "taskId": "3", + "taskType": "monitorProperties", + "taskStartTime": 1538681546116 + } + } + ], + "canApply": false + } + } +} ``` -There is also a blanket kill command, which takes a `minimumActivity` [Duration](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/Duration.html) query parameter (e.g. `PT1H` for 1 hour, the default). -This command should be used with caution, as zombie detection can result in false positives. There is no risk in letting a zombie live, so be safe! -It is not recommended to use a `minimumActivity` value less than 1 hour. +For each Execution, a final action summary is provided `canApply`. +If any part of an Execution cannot be re-hydrated, the entire Execution will be skipped. + +## Cancel the Execution + +If the Execution cannot be rehydrated, it will need to be canceled. +You can cancel the Execution via the UI or force cancellation via an Orca admin API: ``` -POST /admin/queue/zombies:kill?minimumActivity=PT1H +PUT /admin/forceCancelExecution?executionId=01CS076X85RX6MWBTQ0VGBF8VX&executionType=PIPELINE ``` ## Known Causes