Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling describe instances consistency issue #801

Merged
merged 1 commit into from
Apr 19, 2021

Conversation

vdhanan
Copy link
Contributor

@vdhanan vdhanan commented Mar 18, 2021

Is this a bug fix or adding new feature?
fixes #389

What is this PR about? / Why do we need it?
the describeInstances API follows an eventual consistency model. the csi driver should handle the fact that it may get inconsistent responses from this API

What testing is done?
manual testing trying to reproduce

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 18, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @vdhanan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 18, 2021
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 18, 2021
@vdhanan vdhanan marked this pull request as draft March 18, 2021 17:32
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 18, 2021
@coveralls
Copy link

coveralls commented Mar 18, 2021

Pull Request Test Coverage Report for Build 1851

  • 73 of 83 (87.95%) changed or added relevant lines in 1 file are covered.
  • 12 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 82.078%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/cloud/cloud.go 73 83 87.95%
Files with Coverage Reduction New Missed Lines %
pkg/cloud/cloud.go 12 82.57%
Totals Coverage Status
Change from base Build 1848: 0.1%
Covered Lines: 1896
Relevant Lines: 2310

💛 - Coveralls

@vdhanan
Copy link
Contributor Author

vdhanan commented Mar 29, 2021

/unlabel do-not-merge/work-in-progress

@vdhanan vdhanan marked this pull request as ready for review April 2, 2021 17:39
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2021
@vdhanan vdhanan closed this Apr 2, 2021
@vdhanan vdhanan reopened this Apr 2, 2021
@vdhanan vdhanan force-pushed the describeInstances branch 2 times, most recently from b53aa12 to 1833d73 Compare April 2, 2021 23:30
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2021
@AndyXiangLi
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2021
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 7, 2021
pkg/cloud/devicemanager/manager_test.go Outdated Show resolved Hide resolved
pkg/cloud/devicemanager/manager.go Outdated Show resolved Hide resolved
@wongma7
Copy link
Contributor

wongma7 commented Apr 7, 2021

can you describe in detail the exact scenario this is fixing?

If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?

BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

@vdhanan
Copy link
Contributor Author

vdhanan commented Apr 9, 2021

can you describe in detail the exact scenario this is fixing?

If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?

BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

@wongma7 if a pod with a volume attached dies, that volume will be detached, since we don't know where the pod will be recreated. the csi driver checks that the volume has been detached by calling DescribeVolumes. if the pod is recreated on the same node, the csi driver will attempt to reattach the volume. however, if the volume appears in the DescribeInstances call (which is only eventually consistent, meaning it can report stale info) during the attach workflow, the driver assumes it's already assigned and doesn't bother trying to attach it. by erroring out if we see a volume in detaching state, we ensure that the attach workflow will retry, hopefully when DescribeInstances is reporting accurately. (apologies if i used any incorrect terminology here)

@AndyXiangLi
Copy link
Contributor

can you describe in detail the exact scenario this is fixing?
If a volume is in detaching state, and we try to attach it back to the same node, what's the problem?
BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel.

@wongma7 if a pod with a volume attached dies, that volume will be detached, since we don't know where the pod will be recreated. the csi driver checks that the volume has been detached by calling DescribeVolumes. if the pod is recreated on the same node, the csi driver will attempt to reattach the volume. however, if the volume appears in the DescribeInstances call (which is only eventually consistent, meaning it can report stale info) during the attach workflow, the driver assumes it's already assigned and doesn't bother trying to attach it. by erroring out if we see a volume in detaching state, we ensure that the attach workflow will retry, hopefully when DescribeInstances is reporting accurately. (apologies if i used any incorrect terminology here)

if the volume is appearing in the instance Attachment list regardless the volume status, we thought this volume is attached to the instance, driver will not try to call attachVolume api and wait for volume to be attached https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/cloud.go#L384
This will take long time to timeout (2800s from comment)

@wongma7
Copy link
Contributor

wongma7 commented Apr 9, 2021

What if the volume gets detached and then in our reattach attempt the DescribeInstances is so stale that it returns the volume is still in state attached? Then we will proceed and then the volume will get unexpectedly detached no?

Notice in the cloud provider code they check the state in each polling attempt and break out if the volume unexpectedly becomes detached. https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2205. This is a better solution.

I would like to see either
a) Full design doc that accounts for all possibility and justification why we need to invent our own solution. We are dealing with race condition that could cause user to wait for ~45 mins or cause volumes to unexpectedly get detached/attached, solution has to be thorough.
or
b) Copy the existing cloud provider solution which has already gone through its own review, been tested over multiple kubernetes versions, etc. Especially if we care about migration compatibility, this is IMO the best option, I just never got around to doing it. #393

https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2178

@AndyXiangLi
Copy link
Contributor

What if the volume gets detached and then in our reattach attempt the DescribeInstances is so stale that it returns the volume is still in state attached? Then we will proceed and then the volume will get unexpectedly detached no?

Notice in the cloud provider code they check the state in each polling attempt and break out if the volume unexpectedly becomes detached. https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2205. This is a better solution.

I would like to see either
a) Full design doc that accounts for all possibility and justification why we need to invent our own solution. We are dealing with race condition that could cause user to wait for ~45 mins or cause volumes to unexpectedly get detached/attached, solution has to be thorough.
or
b) Copy the existing cloud provider solution which has already gone through its own review, been tested over multiple kubernetes versions, etc. Especially if we care about migration compatibility, this is IMO the best option, I just never got around to doing it. #393

https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2178

@vdhanan In your testing, did you see the detached volume back to "attached" status? I agree to port in-tree waitForAttachmentStatus to CSI as it is more robust

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2021
@vdhanan
Copy link
Contributor Author

vdhanan commented Apr 14, 2021

I ported most of the waitForAttachmentStatus function from in-tree. I think after GA we should just consume the vendor code directly like Matthew mentioned.

@AndyXiangLi
Copy link
Contributor

I ported most of the waitForAttachmentStatus function from in-tree. I think after GA we should just consume the vendor code directly like Matthew mentioned.

Can you add some unit test for the updated function?

@wongma7
Copy link
Contributor

wongma7 commented Apr 15, 2021

/lgtm

thanks, I am really more confident if we just copy the code, I know it's not so glamorous to be doing that but since this issue is so tricky to debug and test (hard to reproduce race condition) I think it's best option!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 15, 2021
ctx := context.Background()

switch tc.name {
case "success: detached":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just a style alternative): if u want to avoid depending on test case name you could make these anonymous functions, something like this https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/driver/node_test.go#L69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's definitely cleaner. i'll use this style next time

name: "failure: already assigned but wrong state",
volumeID: "vol-test-1234",
expectedState: volumeAttachedState,
expectedInstance: "1235",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this say 1234? otherwise this test is giving us a false positive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's giving false postiive at the moment cuz, we should be testing the case where: if we set alreadyAttached to true, and DescribeVolumes returns that the volume is detached, we want to error. Correct me if wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup you're right, it should be 1234

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2021
@AndyXiangLi
Copy link
Contributor

/lgtm
/approve
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndyXiangLi, vdhanan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2021
@k8s-ci-robot k8s-ci-robot merged commit 61d72fc into kubernetes-sigs:master Apr 19, 2021
@vdhanan vdhanan deleted the describeInstances branch April 27, 2021 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle DescribeInstances eventual consistency
5 participants