Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

Closed
wmgroot opened this issue Mar 13, 2021 · 9 comments
Closed

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

wmgroot opened this issue Mar 13, 2021 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@wmgroot
Copy link

wmgroot commented Mar 13, 2021

/kind bug

What happened?

E0312 23:34:38.526557       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0210b73b6a589c69b" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context deadline exceeded
E0312 23:34:41.831258       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0b6db8050fef1f0b1" to node "i-03dd65827d3721aa9": RequestCanceled: request context canceled
caused by: context deadline exceeded
E0312 23:34:41.934187       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0fe7ae500d43277de" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:41.995988       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0c53048e8a615b38c" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:42.056027       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0cbb38b96747240f5" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:42.085334       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-091861f4d11553da1" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context deadline exceeded

What you expected to happen?
I expect EBS Volumes to be attached successfully to a node when a PVC is created.

How to reproduce it (as minimally and precisely as possible)?
This is occurring non-deterministically for us. It seems to happen more frequently with higher volumes of PVC requests.
We are running a few workloads with high Pod/PVC turnover, and we occasionally run into large spikes of this Volume Attachment error.

I was able to reproduce this issue on a fresh node with the following StatefulSet.

  1. Apply the StatefulSet, waiting for all pods to become ready.
  2. Delete the StatefulSet, waiting for all pods to terminate.
  3. Repeat until a Pod fails to create due to a volume mount error.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-set
spec:
  selector:
    matchLabels:
      app: my-stateful-set
  replicas: 50
  serviceName: my-stateful-set-service
  template:
    metadata:
      labels:
        app: my-stateful-set
    spec:
      nodeSelector:
        kubernetes.io/hostname: ip-10-118-4-66.us-east-2.compute.internal  # force all pods/volumes onto the same node
      containers:
        - name: my-stateful-set
          image: nginxinc/nginx-unprivileged:1.16-alpine
          ports:
            - name: web
              containerPort: 8080
              protocol: TCP
          volumeMounts:
            - mountPath: /tmp
              name: tmp-volume
            - mountPath: /usr/share/nginx/html
              name: nginx-volume
      volumes:
        - name: tmp-volume
          emptyDir: {}
  volumeClaimTemplates:
  - metadata:
      name: nginx-volume
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: block-v2  # ebs-csi gp3 volume, spec below

Pod Events

Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Normal   Scheduled           63s                default-scheduler        Successfully assigned default/my-stateful-set-24 to ip-10-118-4-66.us-east-2.compute.internal
  Warning  FailedAttachVolume  15s (x7 over 48s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-e21b1eab-fe7f-40e9-bac8-3d5096a53f46" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

StorageClass

Name:            block-v2
IsDefaultClass:  Yes
Annotations:   
  storageclass.kubernetes.io/is-default-class=true
Provisioner:           ebs.csi.aws.com
Parameters:            encrypted=true,type=gp3
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>

Anything else we need to know?:
I had thought this might be an AWS API Request Rate Limit issue, but the errors appear as ClientError metrics, not RequestLimitExceeded in Cloudwatch.

If I examine the Volume in the AWS Console, it shows as "in-use", but will eventually display a message claiming "Volume stuck in attaching since 1 hours and 38 minutes ago ago."

This could be an issue entirely on AWS's side of the house.

Additional information from the AWS support technician who was helping us troubleshoot this issue.

    Thank you for contacting AWS premium support. It was a pleasure chatting with you today.
    I wanted to follow up with a summarizing email.

    You opened the chat with us today because some of your EBS volumes were stuck in attaching state.

    During our chat, I verified Instance ID(s): i-072a20eb714eb6779 and informed you that Volume ID(s): vol-091861f4d11553da1 was stuck in attaching state but it was reflecting as attached to the instance in the backend.

    After investigating thoroughly, I could identify that Volume ID(s): vol-091861f4d11553da1 was attached with the older name that was used for different volumes.

    It is possible that, older volumes may not have been unmounted properly before detaching. So the OS’s block device driver did not release the name /dev/xvdbk.

    /dev/xvdbk was used for:

    [+] vol-00269a209dfe16ddb
    [+] vol-0096953dda8a10fd6
    [+] vol-00c75d8acfe0b8e72
    [+] vol-025e9c11fc19af987
    [+] vol-02dfcc6c96c84d6b9

    Now the you are trying to attach another volume with same device name, OS thinks I already have a volume with this name.

    Reboot will clear up the entries that are no longer available in device driver, so you will be able to attach this volume.

    Unfortunately,if a user has initiated a forced detach of an Amazon EBS volume, the block device driver of the Amazon EC2 instance might not immediately release the device name for reuse. Attempting to use that device name
    when attaching a volume causes the volume to be stuck in the attaching state. You must either choose a different device name or reboot the
    instance.
    During the chat, after performing a reboot of your instance possibly by one of your organization members, you were able to attach the volume without any issue.

Environment

  • Kubernetes version (use kubectl version):
kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version:
kubectl describe pod -n kube-system ebs-csi-controller-5b9d676d4f-47dvv
Containers:
  ebs-plugin:
    Image:         k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v0.9.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2021
@wmgroot
Copy link
Author

wmgroot commented Mar 15, 2021

Judging from the information given by the AWS support tech, it seems possible that the controller may be somehow causing Ubuntu's block device driver to release the device name improperly when a volume is detached, or that a device name that is already in use is being chosen.

We do not see this issue when using the k8s-provided EBS CSI integration, but that controller does not provide support for gp3 volumes.

$ kubectl describe sc block-v1
Name:            block-v1
IsDefaultClass:  No
Annotations:     storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            type=gp2
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>

@wmgroot
Copy link
Author

wmgroot commented Mar 15, 2021

Updated with a more specific set of repro steps that should hopefully allow others to observe this behavior.

@AndyXiangLi
Copy link
Contributor

Thank you @wmgroot we will look into this issue

@AndyXiangLi
Copy link
Contributor

I found this issue may related to #389 And we have a PR to fix this in #801
Will keep you updated once the PR get merged.

@wmgroot
Copy link
Author

wmgroot commented Mar 23, 2021

Awesome, appreciate the update.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants