handling describe instances consistency issue #801

vdhanan · 2021-03-18T17:32:01Z

Is this a bug fix or adding new feature?
fixes #389

What is this PR about? / Why do we need it?
the describeInstances API follows an eventual consistency model. the csi driver should handle the fact that it may get inconsistent responses from this API

What testing is done?
manual testing trying to reproduce

k8s-ci-robot · 2021-03-18T17:32:04Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2021-03-18T17:32:09Z

Hi @vdhanan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2021-03-18T17:50:51Z

Pull Request Test Coverage Report for Build 1851

73 of 83 (87.95%) changed or added relevant lines in 1 file are covered.
12 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.1%) to 82.078%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/cloud/cloud.go	73	83	87.95%

Files with Coverage Reduction	New Missed Lines	%
pkg/cloud/cloud.go	12	82.57%

Totals
Change from base Build 1848:	0.1%
Covered Lines:	1896
Relevant Lines:	2310

💛 - Coveralls

vdhanan · 2021-03-29T19:46:25Z

/unlabel do-not-merge/work-in-progress

k8s-ci-robot · 2021-04-02T20:28:30Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

AndyXiangLi · 2021-04-02T23:50:52Z

/ok-to-test

pkg/cloud/devicemanager/manager_test.go

pkg/cloud/devicemanager/manager.go

wongma7 · 2021-04-07T23:19:59Z

can you describe in detail the exact scenario this is fixing?

If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?

BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

vdhanan · 2021-04-09T17:17:56Z

can you describe in detail the exact scenario this is fixing?

If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?

BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

@wongma7 if a pod with a volume attached dies, that volume will be detached, since we don't know where the pod will be recreated. the csi driver checks that the volume has been detached by calling DescribeVolumes. if the pod is recreated on the same node, the csi driver will attempt to reattach the volume. however, if the volume appears in the DescribeInstances call (which is only eventually consistent, meaning it can report stale info) during the attach workflow, the driver assumes it's already assigned and doesn't bother trying to attach it. by erroring out if we see a volume in detaching state, we ensure that the attach workflow will retry, hopefully when DescribeInstances is reporting accurately. (apologies if i used any incorrect terminology here)

AndyXiangLi · 2021-04-09T18:07:37Z

can you describe in detail the exact scenario this is fixing?
If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?
BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

@wongma7 if a pod with a volume attached dies, that volume will be detached, since we don't know where the pod will be recreated. the csi driver checks that the volume has been detached by calling DescribeVolumes. if the pod is recreated on the same node, the csi driver will attempt to reattach the volume. however, if the volume appears in the DescribeInstances call (which is only eventually consistent, meaning it can report stale info) during the attach workflow, the driver assumes it's already assigned and doesn't bother trying to attach it. by erroring out if we see a volume in detaching state, we ensure that the attach workflow will retry, hopefully when DescribeInstances is reporting accurately. (apologies if i used any incorrect terminology here)

if the volume is appearing in the instance Attachment list regardless the volume status, we thought this volume is attached to the instance, driver will not try to call attachVolume api and wait for volume to be attached https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/cloud.go#L384
This will take long time to timeout (2800s from comment)

wongma7 · 2021-04-09T18:30:10Z

What if the volume gets detached and then in our reattach attempt the DescribeInstances is so stale that it returns the volume is still in state attached? Then we will proceed and then the volume will get unexpectedly detached no?

Notice in the cloud provider code they check the state in each polling attempt and break out if the volume unexpectedly becomes detached. https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2205. This is a better solution.

I would like to see either
a) Full design doc that accounts for all possibility and justification why we need to invent our own solution. We are dealing with race condition that could cause user to wait for ~45 mins or cause volumes to unexpectedly get detached/attached, solution has to be thorough.
or
b) Copy the existing cloud provider solution which has already gone through its own review, been tested over multiple kubernetes versions, etc. Especially if we care about migration compatibility, this is IMO the best option, I just never got around to doing it. #393

https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2178

AndyXiangLi · 2021-04-09T22:02:58Z

What if the volume gets detached and then in our reattach attempt the DescribeInstances is so stale that it returns the volume is still in state attached? Then we will proceed and then the volume will get unexpectedly detached no?

Notice in the cloud provider code they check the state in each polling attempt and break out if the volume unexpectedly becomes detached. https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2205. This is a better solution.

I would like to see either
a) Full design doc that accounts for all possibility and justification why we need to invent our own solution. We are dealing with race condition that could cause user to wait for ~45 mins or cause volumes to unexpectedly get detached/attached, solution has to be thorough.
or
b) Copy the existing cloud provider solution which has already gone through its own review, been tested over multiple kubernetes versions, etc. Especially if we care about migration compatibility, this is IMO the best option, I just never got around to doing it. #393

https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2178

@vdhanan In your testing, did you see the detached volume back to "attached" status? I agree to port in-tree waitForAttachmentStatus to CSI as it is more robust

vdhanan · 2021-04-14T16:25:02Z

I ported most of the waitForAttachmentStatus function from in-tree. I think after GA we should just consume the vendor code directly like Matthew mentioned.

AndyXiangLi · 2021-04-14T17:03:15Z

I ported most of the waitForAttachmentStatus function from in-tree. I think after GA we should just consume the vendor code directly like Matthew mentioned.

Can you add some unit test for the updated function?

wongma7 · 2021-04-15T23:38:29Z

/lgtm

thanks, I am really more confident if we just copy the code, I know it's not so glamorous to be doing that but since this issue is so tricky to debug and test (hard to reproduce race condition) I think it's best option!

wongma7 · 2021-04-15T23:44:07Z

pkg/cloud/cloud_test.go

+			ctx := context.Background()
+
+			switch tc.name {
+			case "success: detached":


(just a style alternative): if u want to avoid depending on test case name you could make these anonymous functions, something like this https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/driver/node_test.go#L69

that's definitely cleaner. i'll use this style next time

wongma7 · 2021-04-15T23:46:51Z

pkg/cloud/cloud_test.go

+			name:             "failure: already assigned but wrong state",
+			volumeID:         "vol-test-1234",
+			expectedState:    volumeAttachedState,
+			expectedInstance: "1235",


should this say 1234? otherwise this test is giving us a false positive?

I think it's giving false postiive at the moment cuz, we should be testing the case where: if we set alreadyAttached to true, and DescribeVolumes returns that the volume is detached, we want to error. Correct me if wrong.

yup you're right, it should be 1234

AndyXiangLi · 2021-04-19T17:44:28Z

/lgtm
/approve
Thanks!

k8s-ci-robot · 2021-04-19T17:44:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndyXiangLi, vdhanan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AndyXiangLi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 18, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 18, 2021

k8s-ci-robot requested review from ddebroy and wongma7 March 18, 2021 17:32

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 18, 2021

vdhanan marked this pull request as draft March 18, 2021 17:32

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2021

vdhanan force-pushed the describeInstances branch from dc55f33 to 51f33fa Compare March 18, 2021 17:33

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 18, 2021

AndyXiangLi mentioned this pull request Mar 19, 2021

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

Closed

vdhanan marked this pull request as ready for review April 2, 2021 17:39

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2021

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2021

vdhanan closed this Apr 2, 2021

vdhanan reopened this Apr 2, 2021

vdhanan force-pushed the describeInstances branch 2 times, most recently from b53aa12 to 1833d73 Compare April 2, 2021 23:30

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2021

vdhanan force-pushed the describeInstances branch from 1833d73 to 35e3261 Compare April 2, 2021 23:58

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 7, 2021

AndyXiangLi reviewed Apr 7, 2021

View reviewed changes

pkg/cloud/devicemanager/manager_test.go Outdated Show resolved Hide resolved

pkg/cloud/devicemanager/manager.go Outdated Show resolved Hide resolved

vdhanan force-pushed the describeInstances branch from b456309 to 21a67f7 Compare April 7, 2021 22:46

vdhanan force-pushed the describeInstances branch from 21a67f7 to 2b9acda Compare April 14, 2021 16:22

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2021

vdhanan force-pushed the describeInstances branch from 2b9acda to 870829a Compare April 14, 2021 23:49

k8s-ci-robot assigned wongma7 Apr 15, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 15, 2021

wongma7 reviewed Apr 15, 2021

View reviewed changes

handle describeInstances eventual consistency

307ed14

vdhanan force-pushed the describeInstances branch from 870829a to 307ed14 Compare April 19, 2021 16:30

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2021

k8s-ci-robot assigned AndyXiangLi Apr 19, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2021

k8s-ci-robot merged commit 61d72fc into kubernetes-sigs:master Apr 19, 2021

vdhanan deleted the describeInstances branch April 27, 2021 22:43

vdhanan mentioned this pull request Apr 29, 2021

REQUEST: New membership for vdhanan kubernetes/org#2680

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling describe instances consistency issue #801

handling describe instances consistency issue #801

vdhanan commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

coveralls commented Mar 18, 2021 •

edited

Loading

vdhanan commented Mar 29, 2021

k8s-ci-robot commented Apr 2, 2021

AndyXiangLi commented Apr 2, 2021

wongma7 commented Apr 7, 2021

vdhanan commented Apr 9, 2021

AndyXiangLi commented Apr 9, 2021

wongma7 commented Apr 9, 2021

AndyXiangLi commented Apr 9, 2021

vdhanan commented Apr 14, 2021

AndyXiangLi commented Apr 14, 2021

wongma7 commented Apr 15, 2021

wongma7 Apr 15, 2021

vdhanan Apr 19, 2021

wongma7 Apr 15, 2021

wongma7 Apr 15, 2021

vdhanan Apr 19, 2021

AndyXiangLi commented Apr 19, 2021

k8s-ci-robot commented Apr 19, 2021

handling describe instances consistency issue #801

handling describe instances consistency issue #801

Conversation

vdhanan commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

coveralls commented Mar 18, 2021 • edited Loading

Pull Request Test Coverage Report for Build 1851

💛 - Coveralls

vdhanan commented Mar 29, 2021

k8s-ci-robot commented Apr 2, 2021

AndyXiangLi commented Apr 2, 2021

wongma7 commented Apr 7, 2021

vdhanan commented Apr 9, 2021

AndyXiangLi commented Apr 9, 2021

wongma7 commented Apr 9, 2021

AndyXiangLi commented Apr 9, 2021

vdhanan commented Apr 14, 2021

AndyXiangLi commented Apr 14, 2021

wongma7 commented Apr 15, 2021

wongma7 Apr 15, 2021

Choose a reason for hiding this comment

vdhanan Apr 19, 2021

Choose a reason for hiding this comment

wongma7 Apr 15, 2021

Choose a reason for hiding this comment

wongma7 Apr 15, 2021

Choose a reason for hiding this comment

vdhanan Apr 19, 2021

Choose a reason for hiding this comment

AndyXiangLi commented Apr 19, 2021

k8s-ci-robot commented Apr 19, 2021

coveralls commented Mar 18, 2021 •

edited

Loading