Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume mount fail after some time #103

Closed
Vedrillan opened this issue Nov 26, 2019 · 15 comments
Closed

Volume mount fail after some time #103

Vedrillan opened this issue Nov 26, 2019 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Vedrillan
Copy link

/kind bug

What happened?
I have pods setup to use an EFS volume via PV/PVC and it works as expected for the most part, but after usually a few days, pods created for new releases or cronjob start to fail to mount the volume, here is an event log of a failing pod as an example:

Events:
  Type     Reason       Age                   From                                              Message
  ----     ------       ----                  ----                                              -------
  Warning  FailedMount  7m48s (x86 over 21h)  kubelet, ip-10-4-0-98.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "myproject-web-pv" : rpc error: code = Internal desc = Could not mount "fs-1234abcd:/" at "/var/lib/kubelet/pods/5e30c96b-0f9c-11ea-934f-02efb215d860/volumes/kubernetes.io~csi/myproject-web-pv/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o noac,tls fs-1234abcd:/ /var/lib/kubelet/pods/5e30c96b-0f9c-11ea-934f-02efb215d860/volumes/kubernetes.io~csi/myproject-web-pv/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
  Warning  FailedMount  97s (x584 over 22h)  kubelet, ip-10-4-0-98.eu-west-1.compute.internal  Unable to mount volumes for pod "myproject-web-deployment-8c899c749-95mz6_myproject(5e30c96b-0f9c-11ea-934f-02efb215d860)": timeout expired waiting for volumes to attach or mount for pod "myproject"/"myproject-web-deployment-8c899c749-95mz6". list of unmounted volumes=[tmp-files]. list of unattached volumes=[tmp-files default-token-xb47z]

This error happens for every pods on the same node, so at this point the quickest work around I found is to simply drain and remove the faulty node, so that all pods are scheduled on another (or new) node which has the EFS mount working correctly.

Environment

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: image v0.2.0 + stable channel manifests
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2019
@leakingtapan
Copy link
Contributor

leakingtapan commented Nov 26, 2019

Here is a similar error: https://forums.aws.amazon.com/thread.jspa?messageID=867294#867294
Also seems related to: aws/efs-utils#23

For the error:

Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"

That is because there is no proper init system present in the container. This will cause efs stunnel watch dog start to fail here. Although seems an issue, this doesn't seems to be the cause of this issue since the watch dog is never start even in the initial success mount:

bash-4.2# cat /var/log/amazon/efs/mount.log
2019-11-26 20:49:30,695 - INFO - version=1.9 options={'tls': None, 'rw': None}
2019-11-26 20:49:30,700 - WARNING - Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
2019-11-26 20:49:30,737 - INFO - Starting TLS tunnel: "stunnel /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.390d7c5f-108e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv.mount.20388"
2019-11-26 20:49:30,768 - INFO - Started TLS tunnel, pid: 8083
2019-11-26 20:49:30,769 - INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:/ /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount -o rw,noresvport,nfsvers=4.1,retrans=2,hard,wsize=1048576,timeo=600,rsize=1048576,port=20388"
2019-11-26 20:49:31,089 - INFO - Successfully mounted fs-e8a95a42.efs.us-west-2.amazonaws.com at /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount

Still need to investigate why it is happening

@leakingtapan
Copy link
Contributor

leakingtapan commented Nov 26, 2019

The second error is suspicious (see source code here):

Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

So the question is do you have any available port available in the [20049, 20449] range by the time this issue happened to you?

@Vedrillan

@Vedrillan
Copy link
Author

I will check the used ports next time I have this issue on a node.

However it is very likely that only the EFS driver is using this port range (20049-20449) so would that mean that for some reason it is opening a stunnel on each port but (maybe crashes and) never close them properly, until all the range is used and creates this issue?

@leakingtapan
Copy link
Contributor

Do you have simplified steps to reproduce the issue? This could be the reason, I would like to reproduce it first in order to confirm.

@Vedrillan
Copy link
Author

So the issue just happened again, I checked the node and indeed the entire port range is used by stunnel processes. Here is the result of netstat -nltp on the node: faulty_node_netstat.txt

After a little bit of investigation I could find a way to reproduce the issue on my end. When a pod is created, it is mounting the EFS volume, creating a new stunnel in the range 20049-20449, but when the pod is deleted, the stunnel is not closed, I could confirm that by counting the number of stunnel connection after a pod recreation on the same node. So if you recreate a pod (with EFS volume mount) enough time on the same node then the issue will appear again.

Also here is the log when I delete a pod with EFS mount, maybe some information in it can help you as well: pod_deletion_log.txt

@leakingtapan
Copy link
Contributor

leakingtapan commented Nov 29, 2019

Yep. Thx for the information. I am able to confirm that stunnel is not killed after unmount.

The fix will be killing the stunnel process during NodeUnpublishVolume for the given target path.

The pid could be found using efs mount helper state file under /var/run/efs/:

>> ls -al /var/run/efs/
-rw-r--r-- 1 root root  336 Nov 29 00:22 fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
-rw-r--r-- 1 root root  337 Nov 29 00:22 fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20429
-rw-r--r-- 1 root root  336 Nov 29 00:07 fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20428
-rw-r--r-- 1 root root  336 Nov 29 00:07 fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20362
-rw-r--r-- 1 root root  336 Nov 29 00:18 fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv1.mount.20077
-rw-r--r-- 1 root root  336 Nov 29 00:18 fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv2.mount.20152
-rw-r--r-- 1 root root  336 Nov 29 00:11 fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20270
-rw-r--r-- 1 root root  336 Nov 29 00:11 fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20344
-rw-r--r-- 1 root root  384 Nov 29 00:22 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
-rw-r--r-- 1 root root  384 Nov 29 00:22 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20429
-rw-r--r-- 1 root root  384 Nov 29 00:07 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20428
-rw-r--r-- 1 root root  384 Nov 29 00:07 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.4bf0b8f8-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20362
-rw-r--r-- 1 root root  384 Nov 29 00:18 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv1.mount.20077
-rw-r--r-- 1 root root  384 Nov 29 00:18 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.c836f47c-123d-11ea-bb3d-0a95942502dc.volumes.kubernetes.io~csi.efs-pv2.mount.20152
-rw-r--r-- 1 root root  384 Nov 29 00:11 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20270
-rw-r--r-- 1 root root  384 Nov 29 00:11 stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.cc6f0b25-123c-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv2.mount.20344
>> cat /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
foreground = yes
fips = no
socket = l:SO_REUSEADDR=yes
socket = a:SO_BINDTODEVICE=lo
[efs]
sslVersion = TLSv1.2
checkHost = fs-e8a95a42.efs.us-west-2.amazonaws.com
verify = 2
accept = 127.0.0.1:20083
TIMEOUTclose = 0
delay = yes
TIMEOUTbusy = 20
client = yes
connect = fs-e8a95a42.efs.us-west-2.amazonaws.com:2049
renegotiation = no
libwrap = no
>> cat /var/run/efs/fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083
{"files": ["/var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083"], "cmd": ["stunnel", "/var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.49697c3c-123e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv1.mount.20083"], "pid": 96}

It's not ideal to mess around with efs mount helper state file. But that the easy work around I can think of now.

@leakingtapan
Copy link
Contributor

Created issue for efs mount helper: aws/efs-utils#35

@leakingtapan
Copy link
Contributor

Confirmed from above. This issue will be fixed by: #104

@Vedrillan
Copy link
Author

Great news @leakingtapan, do you have any estimates on when the fix could be available?

@leakingtapan
Copy link
Contributor

leakingtapan commented Dec 9, 2019

@Vedrillan Please see the other thread about progress on implementing the fix

@ironmike-au
Copy link

I'm still seeing this problem on the latest dev release.

It looks like the docker image for the dev overlay (chengpan/aws-efs-csi-driver) hasn't been updated with the very latest commit. Based on the time of the last change of the image, I believe it's running 4385428 (Dec 30), when it should be running at least 019e989 (Dec 31).

Is it possible to verify the image so that I can test this? Thanks!

@leakingtapan
Copy link
Contributor

please try amazon/aws-efs-csi-driver:latest this contains the latest build

@ironmike-au
Copy link

Just confirming that the updated image has fixed the problems I've seen with terminated pods not having their EFS mounts disconnected. Thanks @leakingtapan!

@leakingtapan
Copy link
Contributor

Thx for confirming.
/close

@k8s-ci-robot
Copy link
Contributor

@leakingtapan: Closing this issue.

In response to this:

Thx for confirming.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants