-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume mount fail after some time #103
Comments
Here is a similar error: https://forums.aws.amazon.com/thread.jspa?messageID=867294#867294 For the error:
That is because there is no proper init system present in the container. This will cause efs stunnel watch dog start to fail here. Although seems an issue, this doesn't seems to be the cause of this issue since the watch dog is never start even in the initial success mount: bash-4.2# cat /var/log/amazon/efs/mount.log
2019-11-26 20:49:30,695 - INFO - version=1.9 options={'tls': None, 'rw': None}
2019-11-26 20:49:30,700 - WARNING - Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
2019-11-26 20:49:30,737 - INFO - Starting TLS tunnel: "stunnel /var/run/efs/stunnel-config.fs-e8a95a42.var.lib.kubelet.pods.390d7c5f-108e-11ea-84e4-02e886441bde.volumes.kubernetes.io~csi.efs-pv.mount.20388"
2019-11-26 20:49:30,768 - INFO - Started TLS tunnel, pid: 8083
2019-11-26 20:49:30,769 - INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:/ /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount -o rw,noresvport,nfsvers=4.1,retrans=2,hard,wsize=1048576,timeo=600,rsize=1048576,port=20388"
2019-11-26 20:49:31,089 - INFO - Successfully mounted fs-e8a95a42.efs.us-west-2.amazonaws.com at /var/lib/kubelet/pods/390d7c5f-108e-11ea-84e4-02e886441bde/volumes/kubernetes.io~csi/efs-pv/mount Still need to investigate why it is happening |
The second error is suspicious (see source code here):
So the question is do you have any available port available in the [20049, 20449] range by the time this issue happened to you? |
I will check the used ports next time I have this issue on a node. However it is very likely that only the EFS driver is using this port range (20049-20449) so would that mean that for some reason it is opening a stunnel on each port but (maybe crashes and) never close them properly, until all the range is used and creates this issue? |
Do you have simplified steps to reproduce the issue? This could be the reason, I would like to reproduce it first in order to confirm. |
So the issue just happened again, I checked the node and indeed the entire port range is used by stunnel processes. Here is the result of After a little bit of investigation I could find a way to reproduce the issue on my end. When a pod is created, it is mounting the EFS volume, creating a new stunnel in the range 20049-20449, but when the pod is deleted, the stunnel is not closed, I could confirm that by counting the number of stunnel connection after a pod recreation on the same node. So if you recreate a pod (with EFS volume mount) enough time on the same node then the issue will appear again. Also here is the log when I delete a pod with EFS mount, maybe some information in it can help you as well: pod_deletion_log.txt |
Yep. Thx for the information. I am able to confirm that The fix will be killing the stunnel process during The pid could be found using efs mount helper state file under
It's not ideal to mess around with efs mount helper state file. But that the easy work around I can think of now. |
Created issue for efs mount helper: aws/efs-utils#35 |
Confirmed from above. This issue will be fixed by: #104 |
Great news @leakingtapan, do you have any estimates on when the fix could be available? |
@Vedrillan Please see the other thread about progress on implementing the fix |
I'm still seeing this problem on the latest dev release. It looks like the docker image for the dev overlay (chengpan/aws-efs-csi-driver) hasn't been updated with the very latest commit. Based on the time of the last change of the image, I believe it's running 4385428 (Dec 30), when it should be running at least 019e989 (Dec 31). Is it possible to verify the image so that I can test this? Thanks! |
please try amazon/aws-efs-csi-driver:latest this contains the latest build |
Just confirming that the updated image has fixed the problems I've seen with terminated pods not having their EFS mounts disconnected. Thanks @leakingtapan! |
Thx for confirming. |
@leakingtapan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What happened?
I have pods setup to use an EFS volume via PV/PVC and it works as expected for the most part, but after usually a few days, pods created for new releases or cronjob start to fail to mount the volume, here is an event log of a failing pod as an example:
This error happens for every pods on the same node, so at this point the quickest work around I found is to simply drain and remove the faulty node, so that all pods are scheduled on another (or new) node which has the EFS mount working correctly.
Environment
kubectl version
): Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}The text was updated successfully, but these errors were encountered: