Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra reconciliation loops can cause harmless errors due to stale objects #30

Closed
metral opened this issue Jul 23, 2020 · 1 comment
Closed
Assignees
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed

Comments

@metral
Copy link
Contributor

metral commented Jul 23, 2020

Issue

Kubernetes uses optimistic concurrency, which can lead to invalid operations if an object becomes stale. This a feature in k8s, not a bug.

Working with GETs and UPDATES for CustomResources like the Stack means that we will occasionally hit stale data during operations. Here are some examples w.r.t the finalizer being added and executed, and hitting outdated objects. When this happens, we requeue the request iff the step is required for the run -- setting a finalizer is one of these steps.

2020-07-22T18:38:00.022Z        INFO    controller_stack        Adding Finalizer for the Stack  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "stack-test-aws-s3-commit-change-mmit4b"}
2020-07-22T18:38:00.846Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "stack-controller", "request": "default/stack-test-aws-s3-commit-change-mmit4b", 
"error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-commit-change-mmit4b\": the object has been modified; please apply your changes to the latest version and try again"}
Failed to add Pulumi finalizer  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-6itteb", "Stack.Name": "metral/s3-op-project/dev-zvei3i",
error": "Operation cannot be fulfilled on stacks.pulumi.com \"stack-test-aws-s3-6itteb\": the object has been modified; please apply your changes to the latest version and try again”}
2020-07-22T18:39:44.171Z        ERROR   controller_stack        Failed to run Pulumi finalizer  {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-commit-change-mmit4b", "Stack.Name": "metral/s3-op-project/dev-commit-change-autkr6", "error": "destroying resources for stack 'metral/s3-op-project/dev-commit-change-autkr6': exit status 255", "error

We can also see how requeued requests may fail if another loop got further along, e.g. update conflicts or destroy conflicts. We mitigate update conflicts by default by not using the RetryOnUpdateConflict option in the StackSpec, which dismisses conflicted update loops. Destroys (running the finalizer) are left as-is as these repeating themselves is not harmful if the intent to destroy was registered.

2020-07-22T22:12:38.273Z        INFO    controller_stack        Conflict with another concurrent update -- NOT retrying {"Request.Namespace": "default", "Request.Name": "stack-test-aws-s3-g37qr3", "Stack.Name": "metral/s3-op-project/dev-la4p4f", 
"Err:": "exit status 255"}

Extensive testing, use of retries on APIserver conflicts, and hardening of the reconcile loop has turned these extra loop errors mostly into warnings, and in most cases can be ignored.

Suggestions for a fix

  • Identify and elide extra AddFinalizer invocation. We only invoke if not set, but some unidentified event is leading to 2 finalizer registration attempts per test. Favorably, only one loop ever succeeds.
  • Permutations of predicates have not proved effective beyond the resourceGeneration, which ignores events for an Update if the generation number of the API object does not change -- no generation changes is only true for updates to spec.status and metadata changes. Disabling predicates can create extra reconcile loops and an inconsistent stack update activity, so turning them off is not a path forward, however, identifying if there is anything else that can be done here to lower the total number of reconciliation loops would be beneficial.
@metral metral changed the title Extra reconciliation loops can error due to stale objects Extra reconciliation loops can cause harmless errors due to stale objects Jul 23, 2020
@infin8x infin8x added kind/enhancement Improvements or new features and removed enhancement labels Jul 10, 2021
@EronWright EronWright self-assigned this Oct 30, 2024
@EronWright EronWright added the resolution/fixed This issue was fixed label Oct 30, 2024
@EronWright
Copy link
Contributor

This long-standing issue was addressed here:
#717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

4 participants