[v2] Retry failed updates with exponential backoff (#709)

Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified `Update` to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue. We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things: * We consolidate the update controller's streaming logic for consistent error handling across all operations. * We return errors with known gRPC status codes as-is, but unknown status codes are translated into failed results for all operations. * We start tracking the number of times a stack has attempted an update. This is used to determine how much exponential backoff to apply. * A failed update is considered synced for a cooldown period before we retry it. The cooldown period starts at 5 minutes and doubles for every failed attempt, eventually maxing out at 24 hours. Fixes #677
pulumi · Oct 11, 2024 · a4c8810 · a4c8810
1 parent 83f8438
commit a4c8810
Show file tree

Hide file tree

Showing 9 changed files with 802 additions and 363 deletions.
diff --git a/operator/api/pulumi/shared/stack_types.go b/operator/api/pulumi/shared/stack_types.go
@@ -388,6 +388,10 @@ type StackUpdateState struct {
 	Permalink Permalink `json:"permalink,omitempty"`
 	// LastResyncTime contains a timestamp for the last time a resync of the stack took place.
 	LastResyncTime metav1.Time `json:"lastResyncTime,omitempty"`
+	// Failures records how many times the update has been attempted and
+	// failed. Failed updates are periodically retried with exponential backoff
+	// in case the failure was due to transient conditions.
+	Failures int64 `json:"failures"`
 }
 
 // StackUpdateStatus is the status code for the result of a Stack Update run.

diff --git a/operator/config/crd/bases/pulumi.com_stacks.yaml b/operator/config/crd/bases/pulumi.com_stacks.yaml
@@ -9506,6 +9506,13 @@ spec:
                 description: LastUpdate contains details of the status of the last
                   update.
                 properties:
+                  failures:
+                    description: |-
+                      Failures records how many times the update has been attempted and
+                      failed. Failed updates are periodically retried with exponential backoff
+                      in case the failure was due to transient conditions.
+                    format: int64
+                    type: integer
                   generation:
                     description: Generation is the stack generation associated with
                       the update.
@@ -9536,6 +9543,8 @@ spec:
                   type:
                     description: Type is the type of update.
                     type: string
+                required:
+                - failures
                 type: object
               observedGeneration:
                 description: ObservedGeneration records the value of .meta.generation
@@ -18962,6 +18971,13 @@ spec:
                 description: LastUpdate contains details of the status of the last
                   update.
                 properties:
+                  failures:
+                    description: |-
+                      Failures records how many times the update has been attempted and
+                      failed. Failed updates are periodically retried with exponential backoff
+                      in case the failure was due to transient conditions.
+                    format: int64
+                    type: integer
                   generation:
                     description: Generation is the stack generation associated with
                       the update.
@@ -18992,6 +19008,8 @@ spec:
                   type:
                     description: Type is the type of update.
                     type: string
+                required:
+                - failures
                 type: object
               outputs:
                 additionalProperties: