Volume mount fail after some time #103

Vedrillan · 2019-11-26T16:08:59Z

/kind bug

What happened?
I have pods setup to use an EFS volume via PV/PVC and it works as expected for the most part, but after usually a few days, pods created for new releases or cronjob start to fail to mount the volume, here is an event log of a failing pod as an example:

Events:
  Type     Reason       Age                   From                                              Message
  ----     ------       ----                  ----                                              -------
  Warning  FailedMount  7m48s (x86 over 21h)  kubelet, ip-10-4-0-98.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "myproject-web-pv" : rpc error: code = Internal desc = Could not mount "fs-1234abcd:/" at "/var/lib/kubelet/pods/5e30c96b-0f9c-11ea-934f-02efb215d860/volumes/kubernetes.io~csi/myproject-web-pv/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o noac,tls fs-1234abcd:/ /var/lib/kubelet/pods/5e30c96b-0f9c-11ea-934f-02efb215d860/volumes/kubernetes.io~csi/myproject-web-pv/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
  Warning  FailedMount  97s (x584 over 22h)  kubelet, ip-10-4-0-98.eu-west-1.compute.internal  Unable to mount volumes for pod "myproject-web-deployment-8c899c749-95mz6_myproject(5e30c96b-0f9c-11ea-934f-02efb215d860)": timeout expired waiting for volumes to attach or mount for pod "myproject"/"myproject-web-deployment-8c899c749-95mz6". list of unmounted volumes=[tmp-files]. list of unattached volumes=[tmp-files default-token-xb47z]

This error happens for every pods on the same node, so at this point the quickest work around I found is to simply drain and remove the faulty node, so that all pods are scheduled on another (or new) node which has the EFS mount working correctly.

Environment

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Driver version: image v0.2.0 + stable channel manifests

The text was updated successfully, but these errors were encountered:

leakingtapan · 2019-11-26T18:47:03Z

Here is a similar error: https://forums.aws.amazon.com/thread.jspa?messageID=867294#867294
Also seems related to: aws/efs-utils#23

For the error:

Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"

That is because there is no proper init system present in the container. This will cause efs stunnel watch dog start to fail here. Although seems an issue, this doesn't seems to be the cause of this issue since the watch dog is never start even in the initial success mount:

bash-4.2# cat /var/log/amazon/efs/mount.log
2019-11-26 20:49:30,695 - INFO - version=1.9 options={'tls': None, 'rw': None}
2019-11-26 20:49:30,700 - WARNING - Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
2019-11-26 20:49:30,737 - INFO - Starting TLS tunnel: "stunnel /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.390d7c5f-108e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv.mount.20388"
2019-11-26 20:49:30,768 - INFO - Started TLS tunnel, pid: 8083
2019-11-26 20:49:30,769 - INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:/ /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount -o rw,noresvport,nfsvers=4.1,retrans=2,hard,wsize=1048576,timeo=600,rsize=1048576,port=20388"
2019-11-26 20:49:31,089 - INFO - Successfully mounted fs-e8a95a42.efs.us-west-2.amazonaws.com at /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount

Still need to investigate why it is happening

leakingtapan · 2019-11-26T21:33:30Z

The second error is suspicious (see source code here):

Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

So the question is do you have any available port available in the [20049, 20449] range by the time this issue happened to you?

@Vedrillan

Vedrillan · 2019-11-27T13:59:17Z

I will check the used ports next time I have this issue on a node.

However it is very likely that only the EFS driver is using this port range (20049-20449) so would that mean that for some reason it is opening a stunnel on each port but (maybe crashes and) never close them properly, until all the range is used and creates this issue?

leakingtapan · 2019-11-27T19:37:43Z

Do you have simplified steps to reproduce the issue? This could be the reason, I would like to reproduce it first in order to confirm.

Vedrillan · 2019-11-28T09:33:04Z

So the issue just happened again, I checked the node and indeed the entire port range is used by stunnel processes. Here is the result of netstat -nltp on the node: faulty_node_netstat.txt

After a little bit of investigation I could find a way to reproduce the issue on my end. When a pod is created, it is mounting the EFS volume, creating a new stunnel in the range 20049-20449, but when the pod is deleted, the stunnel is not closed, I could confirm that by counting the number of stunnel connection after a pod recreation on the same node. So if you recreate a pod (with EFS volume mount) enough time on the same node then the issue will appear again.

Also here is the log when I delete a pod with EFS mount, maybe some information in it can help you as well: pod_deletion_log.txt

leakingtapan · 2019-11-29T00:58:27Z

Yep. Thx for the information. I am able to confirm that stunnel is not killed after unmount.

The fix will be killing the stunnel process during NodeUnpublishVolume for the given target path.

The pid could be found using efs mount helper state file under /var/run/efs/:

>> ls -al /var/run/efs/
-rw-r--r-- 1 root root  336 Nov 29 00:22 fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
-rw-r--r-- 1 root root  337 Nov 29 00:22 fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20429
-rw-r--r-- 1 root root  336 Nov 29 00:07 fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20428
-rw-r--r-- 1 root root  336 Nov 29 00:07 fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20362
-rw-r--r-- 1 root root  336 Nov 29 00:18 fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv1.mount.20077
-rw-r--r-- 1 root root  336 Nov 29 00:18 fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv2.mount.20152
-rw-r--r-- 1 root root  336 Nov 29 00:11 fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20270
-rw-r--r-- 1 root root  336 Nov 29 00:11 fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20344
-rw-r--r-- 1 root root  384 Nov 29 00:22 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
-rw-r--r-- 1 root root  384 Nov 29 00:22 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20429
-rw-r--r-- 1 root root  384 Nov 29 00:07 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20428
-rw-r--r-- 1 root root  384 Nov 29 00:07 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20362
-rw-r--r-- 1 root root  384 Nov 29 00:18 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv1.mount.20077
-rw-r--r-- 1 root root  384 Nov 29 00:18 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv2.mount.20152
-rw-r--r-- 1 root root  384 Nov 29 00:11 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20270
-rw-r--r-- 1 root root  384 Nov 29 00:11 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20344

>> cat /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
foreground = yes
fips = no
socket = l:SO_REUSEADDR=yes
socket = a:SO_BINDTODEVICE=lo
[efs]
sslVersion = TLSv1.2
checkHost = fs-e8a95a42.efs.us-west-2.amazonaws.com
verify = 2
accept = 127.0.0.1:20083
TIMEOUTclose = 0
delay = yes
TIMEOUTbusy = 20
client = yes
connect = fs-e8a95a42.efs.us-west-2.amazonaws.com:2049
renegotiation = no
libwrap = no

>> cat /var/run/efs/fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
{"files": ["/var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083"], "cmd": ["stunnel", "/var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083"], "pid": 96}

It's not ideal to mess around with efs mount helper state file. But that the easy work around I can think of now.

leakingtapan · 2019-11-29T01:50:25Z

Created issue for efs mount helper: aws/efs-utils#35

leakingtapan · 2019-12-06T19:10:41Z

Confirmed from above. This issue will be fixed by: #104

Vedrillan · 2019-12-09T13:51:37Z

Great news @leakingtapan, do you have any estimates on when the fix could be available?

leakingtapan · 2019-12-09T19:26:59Z

@Vedrillan Please see the other thread about progress on implementing the fix

ironmike-au · 2020-01-22T20:35:20Z

I'm still seeing this problem on the latest dev release.

It looks like the docker image for the dev overlay (chengpan/aws-efs-csi-driver) hasn't been updated with the very latest commit. Based on the time of the last change of the image, I believe it's running 4385428 (Dec 30), when it should be running at least 019e989 (Dec 31).

Is it possible to verify the image so that I can test this? Thanks!

leakingtapan · 2020-01-22T21:14:29Z

please try amazon/aws-efs-csi-driver:latest this contains the latest build

ironmike-au · 2020-02-20T22:28:03Z

Just confirming that the updated image has fixed the problems I've seen with terminated pods not having their EFS mounts disconnected. Thanks @leakingtapan!

leakingtapan · 2020-02-21T01:06:57Z

Thx for confirming.
/close

k8s-ci-robot · 2020-02-21T01:06:59Z

@leakingtapan: Closing this issue.

In response to this:

Thx for confirming.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2019

leakingtapan mentioned this issue Nov 26, 2019

Start efs stunnel watch dog #104

Closed

stefansedich mentioned this issue Dec 28, 2019

Timeouts binding to volume on same node #115

Closed

k8s-ci-robot closed this as completed Feb 21, 2020

jrtcppv mentioned this issue Aug 20, 2020

[EKS/Fargate] [Volumes]: Add support for EFS volumes to EKS Fargate Containers aws/containers-roadmap#826

Closed

dwbrown2 mentioned this issue Apr 19, 2021

Where we need to set PV storage class kubecost/docs#77

Closed

chroche mentioned this issue May 28, 2024

Volume mount fail after some time (again) #1353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume mount fail after some time #103

Volume mount fail after some time #103

Vedrillan commented Nov 26, 2019

leakingtapan commented Nov 26, 2019 •

edited

Loading

leakingtapan commented Nov 26, 2019 •

edited

Loading

Vedrillan commented Nov 27, 2019

leakingtapan commented Nov 27, 2019

Vedrillan commented Nov 28, 2019

leakingtapan commented Nov 29, 2019 •

edited

Loading

leakingtapan commented Nov 29, 2019

leakingtapan commented Dec 6, 2019

Vedrillan commented Dec 9, 2019

leakingtapan commented Dec 9, 2019 •

edited

Loading

ironmike-au commented Jan 22, 2020

leakingtapan commented Jan 22, 2020

ironmike-au commented Feb 20, 2020

leakingtapan commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

Volume mount fail after some time #103

Volume mount fail after some time #103

Comments

Vedrillan commented Nov 26, 2019

leakingtapan commented Nov 26, 2019 • edited Loading

leakingtapan commented Nov 26, 2019 • edited Loading

Vedrillan commented Nov 27, 2019

leakingtapan commented Nov 27, 2019

Vedrillan commented Nov 28, 2019

leakingtapan commented Nov 29, 2019 • edited Loading

leakingtapan commented Nov 29, 2019

leakingtapan commented Dec 6, 2019

Vedrillan commented Dec 9, 2019

leakingtapan commented Dec 9, 2019 • edited Loading

ironmike-au commented Jan 22, 2020

leakingtapan commented Jan 22, 2020

ironmike-au commented Feb 20, 2020

leakingtapan commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

leakingtapan commented Nov 26, 2019 •

edited

Loading

leakingtapan commented Nov 26, 2019 •

edited

Loading

leakingtapan commented Nov 29, 2019 •

edited

Loading

leakingtapan commented Dec 9, 2019 •

edited

Loading