Skip to content

Conversation

@mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Feb 1, 2023

This avoids an extremely surprising write-after-read consistency bug when later returning after making subsequent object changes.

The issue stems from the fact that we don't write to the same place we read from. We read from a cache: a shared informer which is populated from a watch running against the apiserver. However we write directly to the apiserver rather than writing through the cache.

If we patch the watched object without returning, while we continue to execute the change will be written to the api server and will propagate to our own shared informer. This change will result in a new reconcile being queued for this version of the object which now has a Finalizer but no other changes.

Taking the case of the machine controller we will then go on to create a server and we will patch the object with a providerID and return. On return we will write this change to the apiserver which will eventually propagate to our shared informer and result in another reconcile. However our previous patch has already propagated, so we will be immediately called with the old version of the object. If we've been careful this will hopefully be no more than inefficient, but can lead to very hard to debug errors.

The safe way to patch objects when reconcile is called from an informer is to always return after patching an object which will directly result in another reconcile. We also must not set Requeue in the returned Result as this will result in being called again with the stale object. When patching an object we must return and wait for the informer to see the change and call the reconciler again.

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 1, 2023
@netlify
Copy link

netlify bot commented Feb 1, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit d71d7aa
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/63da4ebb6778b6000aea4572
😎 Deploy Preview https://deploy-preview-1464--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 1, 2023
This avoids an extremely surprising write-after-read consistency bug
when later returning after making subsequent object changes.

The issue stems from the fact that we don't write to the same place we
read from. We read from a cache: a shared informer which is populated
from a watch running against the apiserver. However we write directly to
the apiserver rather than writing through the cache.

If we patch the watched object without returning, while we continue to
execute the change will be written to the api server and will propagate
to our own shared informer. This change will result in a new reconcile
being queued for this version of the object which now has a Finalizer
but no other changes.

Taking the case of the machine controller we will then go on to create a
server and we will patch the object with a providerID and return. On
return we will write this change to the apiserver which will eventually
propagate to our shared informer and result in another reconcile.
However our previous patch has already propagated, so we will be
*immediately* called with the *old version* of the object. If we've been
careful this will hopefully be no more than inefficient, but can lead to
very hard to debug errors.

The safe way to patch objects when reconcile is called from an informer
is to always return after patching an object which will directly result
in another reconcile. We also *must not* set Requeue in the returned
Result as this will result in being called again with the stale object.
When patching an object we must return and wait for the informer to see
the change and call the reconciler again.
@mdbooth mdbooth force-pushed the return-after-patch branch from 7387216 to d71d7aa Compare February 1, 2023 11:36
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2023
@jichenjc
Copy link
Contributor

jichenjc commented Feb 3, 2023

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2023
@k8s-ci-robot k8s-ci-robot merged commit 8b7c30f into kubernetes-sigs:main Feb 3, 2023
@pierreprinetti pierreprinetti deleted the return-after-patch branch February 7, 2023 10:38
@pierreprinetti
Copy link
Contributor

Thank you for the fix and even more for the explanation.

@lentzi90
Copy link
Contributor

lentzi90 commented Feb 7, 2023

Should we backport to release-0.7? I think it would be nice to get this in a release before the next minor 🙂

@lentzi90
Copy link
Contributor

/cherry-pick release-0.7

@k8s-infra-cherrypick-robot

@lentzi90: #1464 failed to apply on top of branch "release-0.7":

Applying: Return from reconciler after adding finalizer
Using index info to reconstruct a base tree...
M	controllers/openstackcluster_controller.go
M	controllers/openstackmachine_controller.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackmachine_controller.go
CONFLICT (content): Merge conflict in controllers/openstackmachine_controller.go
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Return from reconciler after adding finalizer
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-0.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants