Extra reconciliation loops can cause harmless errors due to stale objects #30

metral · 2020-07-23T00:05:09Z

Issue

Kubernetes uses optimistic concurrency, which can lead to invalid operations if an object becomes stale. This a feature in k8s, not a bug.

Working with GETs and UPDATES for CustomResources like the Stack means that we will occasionally hit stale data during operations. Here are some examples w.r.t the finalizer being added and executed, and hitting outdated objects. When this happens, we requeue the request iff the step is required for the run -- setting a finalizer is one of these steps.

2020-07-22T18:38:00.022Z        INFO    controller_stack        Adding Finalizer for the Stack  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "stack-test-aws-s3-commit-change-mmit4b"}
2020-07-22T18:38:00.846Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "stack-controller", "request": "default/stack-test-aws-s3-commit-change-mmit4b", 
"error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-commit-change-mmit4b\": the object has been modified; please apply your changes to the latest version and try again"}

Failed to add Pulumi finalizer  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-6itteb", "Stack.Name": "metral/s3-op-project/dev-zvei3i",
error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-6itteb\": the object has been modified; please apply your changes to the latest version and try again”}

2020-07-22T18:39:44.171Z        ERROR   controller_stack        Failed to run Pulumi finalizer  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "metral/s3-op-project/dev-commit-change-autkr6", "error": "destroying resources for stack 'metral/s3-op-project/dev-commit-change-autkr6': exit status 255", "error

We can also see how requeued requests may fail if another loop got further along, e.g. update conflicts or destroy conflicts. We mitigate update conflicts by default by not using the RetryOnUpdateConflict option in the StackSpec, which dismisses conflicted update loops. Destroys (running the finalizer) are left as-is as these repeating themselves is not harmful if the intent to destroy was registered.

2020-07-22T22:12:38.273Z        INFO    controller_stack        Conflict with another concurrent update -- NOT retrying {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-g37qr3", "Stack.Name": "metral/s3-op-project/dev-la4p4f", 
"Err:": "exit status 255"}

Extensive testing, use of retries on APIserver conflicts, and hardening of the reconcile loop has turned these extra loop errors mostly into warnings, and in most cases can be ignored.

Suggestions for a fix

Identify and elide extra AddFinalizer invocation. We only invoke if not set, but some unidentified event is leading to 2 finalizer registration attempts per test. Favorably, only one loop ever succeeds.
Permutations of predicates have not proved effective beyond the resourceGeneration, which ignores events for an Update if the generation number of the API object does not change -- no generation changes is only true for updates to spec.status and metadata changes. Disabling predicates can create extra reconcile loops and an inconsistent stack update activity, so turning them off is not a path forward, however, identifying if there is anything else that can be done here to lower the total number of reconciliation loops would be beneficial.

The text was updated successfully, but these errors were encountered:

EronWright · 2024-10-30T00:12:17Z

This long-standing issue was addressed here:
#717

metral mentioned this issue Jul 23, 2020

Harden the reconciliation loop to work best with the APIserver #29

Merged

metral changed the title ~~Extra reconciliation loops can error due to stale objects~~ Extra reconciliation loops can cause harmless errors due to stale objects Jul 23, 2020

leezen added the enhancement label Nov 13, 2020

infin8x added kind/enhancement Improvements or new features and removed enhancement labels Jul 10, 2021

EronWright self-assigned this Oct 30, 2024

EronWright added the resolution/fixed This issue was fixed label Oct 30, 2024

EronWright closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra reconciliation loops can cause harmless errors due to stale objects #30

Extra reconciliation loops can cause harmless errors due to stale objects #30

metral commented Jul 23, 2020 •

edited

Loading

EronWright commented Oct 30, 2024

Extra reconciliation loops can cause harmless errors due to stale objects #30

Extra reconciliation loops can cause harmless errors due to stale objects #30

Comments

metral commented Jul 23, 2020 • edited Loading

Issue

Suggestions for a fix

EronWright commented Oct 30, 2024

metral commented Jul 23, 2020 •

edited

Loading