Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement a back-off for re-attempts at a failed update #677

Closed
EronWright opened this issue Sep 20, 2024 · 1 comment
Closed

implement a back-off for re-attempts at a failed update #677

EronWright opened this issue Sep 20, 2024 · 1 comment
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@EronWright
Copy link
Contributor

EronWright commented Sep 20, 2024

The stack controller seems to retry failed updates without using a backoff. The built-in reconciliation backoff kicks in only when reconcile returns an error, which isn't the case here.

Consider looking at lastUpdate.endTime to implement a backoff strategy when lastUpdate.status is failed. Assumedly the stack would stay marked as Reconciling (rather than Stalled).

@EronWright EronWright converted this from a draft issue Sep 20, 2024
@cleverguy25
Copy link

Added to epic #586

@pulumi-bot pulumi-bot added the needs-triage Needs attention from the triage team label Sep 20, 2024
@EronWright EronWright added kind/bug Some behavior is incorrect or out of spec and removed needs-triage Needs attention from the triage team labels Sep 20, 2024
@blampe blampe self-assigned this Oct 1, 2024
@blampe blampe added this to the 0.111 milestone Oct 1, 2024
blampe added a commit that referenced this issue Oct 11, 2024
Currently, if our automation APIs call fail they return non-nil errors
to the operator. In #676 I modified `Update` to translate these errors
into a "failed" status on the Update/Stack, but other operations
(preview etc.) still surface these errors and automatically re-queue.

We'd like to retry these failed updates much less aggressively than we
retry transient network errors, for example. To accomplish this we do a
few things:

* We consolidate the update controller's streaming logic for consistent
error handling across all operations.
* We return errors with known gRPC status codes as-is, but unknown
status codes are translated into failed results for all operations.
* We start tracking the number of times a stack has attempted an update.
This is used to determine how much exponential backoff to apply.
* A failed update is considered synced for a cooldown period before we
retry it. The cooldown period starts at 5 minutes and doubles for every
failed attempt, eventually maxing out at 24 hours.

Fixes #677
@EronWright EronWright added the resolution/fixed This issue was fixed label Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants