You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kubernetes uses optimistic concurrency, which can lead to invalid operations if an object becomes stale. This a feature in k8s, not a bug.
Working with GETs and UPDATES for CustomResources like the Stack means that we will occasionally hit stale data during operations. Here are some examples w.r.t the finalizer being added and executed, and hitting outdated objects. When this happens, we requeue the request iff the step is required for the run -- setting a finalizer is one of these steps.
2020-07-22T18:38:00.022Z INFO controller_stack Adding Finalizer for the Stack {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "stack-test-aws-s3-commit-change-mmit4b"}
2020-07-22T18:38:00.846Z ERROR controller-runtime.controller Reconciler error {"controller": "stack-controller", "request": "default/stack-test-aws-s3-commit-change-mmit4b",
"error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-commit-change-mmit4b\": the object has been modified; please apply your changes to the latest version and try again"}
Failed to add Pulumi finalizer {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-6itteb", "Stack.Name": "metral/s3-op-project/dev-zvei3i",
error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-6itteb\": the object has been modified; please apply your changes to the latest version and try again”}
2020-07-22T18:39:44.171Z ERROR controller_stack Failed to run Pulumi finalizer {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "metral/s3-op-project/dev-commit-change-autkr6", "error": "destroying resources for stack 'metral/s3-op-project/dev-commit-change-autkr6': exit status 255", "error
We can also see how requeued requests may fail if another loop got further along, e.g. update conflicts or destroy conflicts. We mitigate update conflicts by default by not using theRetryOnUpdateConflict option in the StackSpec, which dismisses conflicted update loops. Destroys (running the finalizer) are left as-is as these repeating themselves is not harmful if the intent to destroy was registered.
2020-07-22T22:12:38.273Z INFO controller_stack Conflict with another concurrent update -- NOT retrying {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-g37qr3", "Stack.Name": "metral/s3-op-project/dev-la4p4f",
"Err:": "exit status 255"}
Extensive testing, use of retries on APIserver conflicts, and hardening of the reconcile loop has turned these extra loop errors mostly into warnings, and in most cases can be ignored.
Suggestions for a fix
Identify and elide extra AddFinalizer invocation. We only invoke if not set, but some unidentified event is leading to 2 finalizer registration attempts per test. Favorably, only one loop ever succeeds.
Permutations of predicates have not proved effective beyond the resourceGeneration, which ignores events for an Update if the generation number of the API object does not change -- no generation changes is only true for updates to spec.status and metadata changes. Disabling predicates can create extra reconcile loops and an inconsistent stack update activity, so turning them off is not a path forward, however, identifying if there is anything else that can be done here to lower the total number of reconciliation loops would be beneficial.
The text was updated successfully, but these errors were encountered:
metral
changed the title
Extra reconciliation loops can error due to stale objects
Extra reconciliation loops can cause harmless errors due to stale objects
Jul 23, 2020
Issue
Kubernetes uses optimistic concurrency, which can lead to invalid operations if an object becomes stale. This a feature in k8s, not a bug.
Working with GETs and UPDATES for CustomResources like the Stack means that we will occasionally hit stale data during operations. Here are some examples w.r.t the finalizer being added and executed, and hitting outdated objects. When this happens, we requeue the request iff the step is required for the run -- setting a finalizer is one of these steps.
We can also see how requeued requests may fail if another loop got further along, e.g. update conflicts or destroy conflicts. We mitigate update conflicts by default by not using the
RetryOnUpdateConflict
option in theStackSpec
, which dismisses conflicted update loops. Destroys (running the finalizer) are left as-is as these repeating themselves is not harmful if the intent to destroy was registered.Extensive testing, use of retries on APIserver conflicts, and hardening of the reconcile loop has turned these extra loop errors mostly into warnings, and in most cases can be ignored.
Suggestions for a fix
AddFinalizer
invocation. We only invoke if not set, but some unidentified event is leading to 2 finalizer registration attempts per test. Favorably, only one loop ever succeeds.resourceGeneration
, which ignores events for anUpdate
if thegeneration
number of the API object does not change -- no generation changes is only true for updates tospec.status
andmetadata
changes. Disabling predicates can create extra reconcile loops and an inconsistent stack update activity, so turning them off is not a path forward, however, identifying if there is anything else that can be done here to lower the total number of reconciliation loops would be beneficial.The text was updated successfully, but these errors were encountered: