implement a back-off for re-attempts at a failed update #677

EronWright · 2024-09-20T22:26:02Z

The stack controller seems to retry failed updates without using a backoff. The built-in reconciliation backoff kicks in only when reconcile returns an error, which isn't the case here.

Consider looking at lastUpdate.endTime to implement a backoff strategy when lastUpdate.status is failed. Assumedly the stack would stay marked as Reconciling (rather than Stalled).

The text was updated successfully, but these errors were encountered:

cleverguy25 · 2024-09-20T22:26:04Z

Added to epic #586

Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified `Update` to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue. We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things: * We consolidate the update controller's streaming logic for consistent error handling across all operations. * We return errors with known gRPC status codes as-is, but unknown status codes are translated into failed results for all operations. * We start tracking the number of times a stack has attempted an update. This is used to determine how much exponential backoff to apply. * A failed update is considered synced for a cooldown period before we retry it. The cooldown period starts at 5 minutes and doubles for every failed attempt, eventually maxing out at 24 hours. Fixes #677

EronWright added this to Pulumi Kubernetes Operator v2 Sep 20, 2024

EronWright converted this from a draft issue Sep 20, 2024

pulumi-bot added the needs-triage Needs attention from the triage team label Sep 20, 2024

cleverguy25 mentioned this issue Sep 20, 2024

[Epic] Kubernetes Operator Core Functionality Enhancements (PKOv2) #586

Closed

5 tasks

EronWright added kind/bug Some behavior is incorrect or out of spec and removed needs-triage Needs attention from the triage team labels Sep 20, 2024

EronWright mentioned this issue Sep 20, 2024

Support for pulumi.com/reconciliation-request #672

Closed

blampe self-assigned this Oct 1, 2024

blampe added this to the 0.111 milestone Oct 1, 2024

blampe mentioned this issue Oct 4, 2024

[v2] Retry failed updates with exponential backoff #709

Merged

EronWright added the resolution/fixed This issue was fixed label Oct 12, 2024

EronWright closed this as completed Oct 12, 2024

github-project-automation bot moved this to Done in Pulumi Kubernetes Operator v2 Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement a back-off for re-attempts at a failed update #677

implement a back-off for re-attempts at a failed update #677

EronWright commented Sep 20, 2024 •

edited

Loading

cleverguy25 commented Sep 20, 2024

implement a back-off for re-attempts at a failed update #677

implement a back-off for re-attempts at a failed update #677

Comments

EronWright commented Sep 20, 2024 • edited Loading

cleverguy25 commented Sep 20, 2024

EronWright commented Sep 20, 2024 •

edited

Loading