🐛 Return from reconciler after adding finalizer #1464

mdbooth · 2023-02-01T11:31:51Z

This avoids an extremely surprising write-after-read consistency bug when later returning after making subsequent object changes.

The issue stems from the fact that we don't write to the same place we read from. We read from a cache: a shared informer which is populated from a watch running against the apiserver. However we write directly to the apiserver rather than writing through the cache.

If we patch the watched object without returning, while we continue to execute the change will be written to the api server and will propagate to our own shared informer. This change will result in a new reconcile being queued for this version of the object which now has a Finalizer but no other changes.

Taking the case of the machine controller we will then go on to create a server and we will patch the object with a providerID and return. On return we will write this change to the apiserver which will eventually propagate to our shared informer and result in another reconcile. However our previous patch has already propagated, so we will be immediately called with the old version of the object. If we've been careful this will hopefully be no more than inefficient, but can lead to very hard to debug errors.

The safe way to patch objects when reconcile is called from an informer is to always return after patching an object which will directly result in another reconcile. We also must not set Requeue in the returned Result as this will result in being called again with the stale object. When patching an object we must return and wait for the informer to see the change and call the reconciler again.

/hold

netlify · 2023-02-01T11:32:01Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`d71d7aa`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/63da4ebb6778b6000aea4572
😎 Deploy Preview	https://deploy-preview-1464--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

k8s-ci-robot · 2023-02-01T11:32:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mdbooth]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This avoids an extremely surprising write-after-read consistency bug when later returning after making subsequent object changes. The issue stems from the fact that we don't write to the same place we read from. We read from a cache: a shared informer which is populated from a watch running against the apiserver. However we write directly to the apiserver rather than writing through the cache. If we patch the watched object without returning, while we continue to execute the change will be written to the api server and will propagate to our own shared informer. This change will result in a new reconcile being queued for this version of the object which now has a Finalizer but no other changes. Taking the case of the machine controller we will then go on to create a server and we will patch the object with a providerID and return. On return we will write this change to the apiserver which will eventually propagate to our shared informer and result in another reconcile. However our previous patch has already propagated, so we will be *immediately* called with the *old version* of the object. If we've been careful this will hopefully be no more than inefficient, but can lead to very hard to debug errors. The safe way to patch objects when reconcile is called from an informer is to always return after patching an object which will directly result in another reconcile. We also *must not* set Requeue in the returned Result as this will result in being called again with the stale object. When patching an object we must return and wait for the informer to see the change and call the reconciler again.

lentzi90

/lgtm

jichenjc · 2023-02-03T06:27:18Z

/hold cancel
/lgtm

pierreprinetti · 2023-02-07T10:42:18Z

Thank you for the fix and even more for the explanation.

lentzi90 · 2023-02-07T11:20:57Z

Should we backport to release-0.7? I think it would be nice to get this in a release before the next minor 🙂

lentzi90 · 2023-02-16T11:11:56Z

/cherry-pick release-0.7

k8s-infra-cherrypick-robot · 2023-02-16T11:12:36Z

@lentzi90: #1464 failed to apply on top of branch "release-0.7":

Applying: Return from reconciler after adding finalizer
Using index info to reconstruct a base tree...
M	controllers/openstackcluster_controller.go
M	controllers/openstackmachine_controller.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackmachine_controller.go
CONFLICT (content): Merge conflict in controllers/openstackmachine_controller.go
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Return from reconciler after adding finalizer
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-0.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 1, 2023

k8s-ci-robot requested a review from seanschneeweiss February 1, 2023 11:32

k8s-ci-robot requested a review from tobiasgiese February 1, 2023 11:32

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 1, 2023

mdbooth force-pushed the return-after-patch branch from 7387216 to d71d7aa Compare February 1, 2023 11:36

lentzi90 reviewed Feb 2, 2023

View reviewed changes

k8s-ci-robot assigned lentzi90 Feb 2, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2023

k8s-ci-robot assigned jichenjc Feb 3, 2023

k8s-ci-robot merged commit 8b7c30f into kubernetes-sigs:main Feb 3, 2023

pierreprinetti deleted the return-after-patch branch February 7, 2023 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Return from reconciler after adding finalizer #1464

🐛 Return from reconciler after adding finalizer #1464

mdbooth commented Feb 1, 2023

Uh oh!

netlify bot commented Feb 1, 2023 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 1, 2023

Uh oh!

lentzi90 left a comment

Uh oh!

jichenjc commented Feb 3, 2023

Uh oh!

pierreprinetti commented Feb 7, 2023

Uh oh!

lentzi90 commented Feb 7, 2023

Uh oh!

lentzi90 commented Feb 16, 2023

Uh oh!

k8s-infra-cherrypick-robot commented Feb 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

🐛 Return from reconciler after adding finalizer #1464

🐛 Return from reconciler after adding finalizer #1464

Conversation

mdbooth commented Feb 1, 2023

Uh oh!

netlify bot commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Uh oh!

k8s-ci-robot commented Feb 1, 2023

Uh oh!

lentzi90 left a comment

Choose a reason for hiding this comment

Uh oh!

jichenjc commented Feb 3, 2023

Uh oh!

pierreprinetti commented Feb 7, 2023

Uh oh!

lentzi90 commented Feb 7, 2023

Uh oh!

lentzi90 commented Feb 16, 2023

Uh oh!

k8s-infra-cherrypick-robot commented Feb 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Feb 1, 2023 •

edited

Loading