Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

wmgroot · 2021-03-13T00:03:51Z

/kind bug

What happened?

E0312 23:34:38.526557       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0210b73b6a589c69b" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context deadline exceeded
E0312 23:34:41.831258       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0b6db8050fef1f0b1" to node "i-03dd65827d3721aa9": RequestCanceled: request context canceled
caused by: context deadline exceeded
E0312 23:34:41.934187       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0fe7ae500d43277de" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:41.995988       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0c53048e8a615b38c" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:42.056027       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-0cbb38b96747240f5" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context canceled
E0312 23:34:42.085334       1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-091861f4d11553da1" to node "i-0deb566b98a30ac3b": RequestCanceled: request context canceled
caused by: context deadline exceeded

What you expected to happen?
I expect EBS Volumes to be attached successfully to a node when a PVC is created.

How to reproduce it (as minimally and precisely as possible)?
This is occurring non-deterministically for us. It seems to happen more frequently with higher volumes of PVC requests.
We are running a few workloads with high Pod/PVC turnover, and we occasionally run into large spikes of this Volume Attachment error.

I was able to reproduce this issue on a fresh node with the following StatefulSet.

Apply the StatefulSet, waiting for all pods to become ready.
Delete the StatefulSet, waiting for all pods to terminate.
Repeat until a Pod fails to create due to a volume mount error.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-set
spec:
  selector:
    matchLabels:
      app: my-stateful-set
  replicas: 50
  serviceName: my-stateful-set-service
  template:
    metadata:
      labels:
        app: my-stateful-set
    spec:
      nodeSelector:
        kubernetes.io/hostname: ip-10-118-4-66.us-east-2.compute.internal  # force all pods/volumes onto the same node
      containers:
        - name: my-stateful-set
          image: nginxinc/nginx-unprivileged:1.16-alpine
          ports:
            - name: web
              containerPort: 8080
              protocol: TCP
          volumeMounts:
            - mountPath: /tmp
              name: tmp-volume
            - mountPath: /usr/share/nginx/html
              name: nginx-volume
      volumes:
        - name: tmp-volume
          emptyDir: {}
  volumeClaimTemplates:
  - metadata:
      name: nginx-volume
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: block-v2  # ebs-csi gp3 volume, spec below

Pod Events

Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Normal   Scheduled           63s                default-scheduler        Successfully assigned default/my-stateful-set-24 to ip-10-118-4-66.us-east-2.compute.internal
  Warning  FailedAttachVolume  15s (x7 over 48s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-e21b1eab-fe7f-40e9-bac8-3d5096a53f46" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

StorageClass

Name:            block-v2
IsDefaultClass:  Yes
Annotations:   
  storageclass.kubernetes.io/is-default-class=true
Provisioner:           ebs.csi.aws.com
Parameters:            encrypted=true,type=gp3
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>

Anything else we need to know?:
I had thought this might be an AWS API Request Rate Limit issue, but the errors appear as ClientError metrics, not RequestLimitExceeded in Cloudwatch.

If I examine the Volume in the AWS Console, it shows as "in-use", but will eventually display a message claiming "Volume stuck in attaching since 1 hours and 38 minutes ago ago."

This could be an issue entirely on AWS's side of the house.

Additional information from the AWS support technician who was helping us troubleshoot this issue.

    Thank you for contacting AWS premium support. It was a pleasure chatting with you today.
    I wanted to follow up with a summarizing email.

    You opened the chat with us today because some of your EBS volumes were stuck in attaching state.

    During our chat, I verified Instance ID(s): i-072a20eb714eb6779 and informed you that Volume ID(s): vol-091861f4d11553da1 was stuck in attaching state but it was reflecting as attached to the instance in the backend.

    After investigating thoroughly, I could identify that Volume ID(s): vol-091861f4d11553da1 was attached with the older name that was used for different volumes.

    It is possible that, older volumes may not have been unmounted properly before detaching. So the OS’s block device driver did not release the name /dev/xvdbk.

    /dev/xvdbk was used for:

    [+] vol-00269a209dfe16ddb
    [+] vol-0096953dda8a10fd6
    [+] vol-00c75d8acfe0b8e72
    [+] vol-025e9c11fc19af987
    [+] vol-02dfcc6c96c84d6b9

    Now the you are trying to attach another volume with same device name, OS thinks I already have a volume with this name.

    Reboot will clear up the entries that are no longer available in device driver, so you will be able to attach this volume.

    Unfortunately,if a user has initiated a forced detach of an Amazon EBS volume, the block device driver of the Amazon EC2 instance might not immediately release the device name for reuse. Attempting to use that device name
    when attaching a volume causes the volume to be stuck in the attaching state. You must either choose a different device name or reboot the
    instance.
    During the chat, after performing a reboot of your instance possibly by one of your organization members, you were able to attach the volume without any issue.

Environment

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Driver version:

kubectl describe pod -n kube-system ebs-csi-controller-5b9d676d4f-47dvv
Containers:
  ebs-plugin:
    Image:         k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v0.9.0

The text was updated successfully, but these errors were encountered:

wmgroot · 2021-03-15T15:39:24Z

Judging from the information given by the AWS support tech, it seems possible that the controller may be somehow causing Ubuntu's block device driver to release the device name improperly when a volume is detached, or that a device name that is already in use is being chosen.

We do not see this issue when using the k8s-provided EBS CSI integration, but that controller does not provide support for gp3 volumes.

$ kubectl describe sc block-v1
Name:            block-v1
IsDefaultClass:  No
Annotations:     storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            type=gp2
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>

wmgroot · 2021-03-15T21:40:04Z

Updated with a more specific set of repro steps that should hopefully allow others to observe this behavior.

AndyXiangLi · 2021-03-17T21:30:36Z

Thank you @wmgroot we will look into this issue

AndyXiangLi · 2021-03-19T17:57:29Z

I found this issue may related to #389 And we have a PR to fix this in #801
Will keep you updated once the PR get merged.

wmgroot · 2021-03-23T04:53:28Z

Awesome, appreciate the update.

fejta-bot · 2021-06-21T05:03:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-21T05:38:42Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-20T06:12:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-20T06:12:22Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2021

k8s-ci-robot closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

wmgroot commented Mar 13, 2021 •

edited

Loading

wmgroot commented Mar 15, 2021 •

edited

Loading

wmgroot commented Mar 15, 2021

AndyXiangLi commented Mar 17, 2021

AndyXiangLi commented Mar 19, 2021

wmgroot commented Mar 23, 2021

fejta-bot commented Jun 21, 2021

fejta-bot commented Jul 21, 2021

k8s-triage-robot commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

Failure to AttachVolume, RequestCanceled: context deadline exceeded #795

Comments

wmgroot commented Mar 13, 2021 • edited Loading

wmgroot commented Mar 15, 2021 • edited Loading

wmgroot commented Mar 15, 2021

AndyXiangLi commented Mar 17, 2021

AndyXiangLi commented Mar 19, 2021

wmgroot commented Mar 23, 2021

fejta-bot commented Jun 21, 2021

fejta-bot commented Jul 21, 2021

k8s-triage-robot commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

wmgroot commented Mar 13, 2021 •

edited

Loading

wmgroot commented Mar 15, 2021 •

edited

Loading