Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[v2] Retry failed updates with exponential backoff (#709)
Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified `Update` to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue. We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things: * We consolidate the update controller's streaming logic for consistent error handling across all operations. * We return errors with known gRPC status codes as-is, but unknown status codes are translated into failed results for all operations. * We start tracking the number of times a stack has attempted an update. This is used to determine how much exponential backoff to apply. * A failed update is considered synced for a cooldown period before we retry it. The cooldown period starts at 5 minutes and doubles for every failed attempt, eventually maxing out at 24 hours. Fixes #677
- Loading branch information