Timeouts binding to volume on same node #115

stefansedich · 2019-12-28T16:36:49Z

/kind bug

What happened?

Have started seeing the timeout expired waiting for volumes to attach or mount for pod error when a pod is being restarted and it lands on the same node after I let things sit for some time. This is only recent and it appears ever since I updated to the latest EKS AMI but I cannot prove that for sure at the moment (still digging deeper).

I can have a pod binding /foo using one PVC and even if I try start another pod with a totally seperate PVC mounting say /bah I am still seeing timeouts, if that pod moves to a node without any binds at all everything works.

My current fix is to terminate my nodes and upon coming back up binds will work for some time. For some reason everything will work fine at first I can restart pods hapilly but then if I come back a number of hours later the problem will appear again.

One more thing is looking at the efs-plugin container I see no more NodePublishVolume coming through since the initial binds when I last terminated my nodes to make things happy which is also odd. I attempted to restart the csi pods to no avail the only fix appears to be to bring up new nodes.

What you expected to happen?

For no mount timeouts to occur.

How to reproduce it (as minimally and precisely as possible)?

Create two PVC for the same EFS volume both for seperate folders in the root, create the pods on the same node, the second pod will fail to start due to timeouts.

Anything else we need to know?:

Running in EKS using latest AMI.

Environment

Kubernetes version (use kubectl version): v1.14.9-eks-c0eccc
Driver version: 0.2.0

The text was updated successfully, but these errors were encountered:

stefansedich · 2019-12-28T17:13:20Z

Checked the node in question, saw some dangling nfs mounts which I unmounted, but even after that no mounts were working.

I then restarted the csi container on that node and attempted to mount from efs again and it does not even show any attempted bind in the logs, just silence.

Terminate the node and attempt to bind on a fresh node, and it works as expected.

stefansedich · 2019-12-28T18:05:52Z

Looking at the node where things are failing I am seeing the following in dmesg too. If I were to exec into the csi pod on the node and run a mount -t efs fs-xxx:/ /mnt/foo it mounts fine.

[76828.295688] nfs: server 127.0.0.1 not responding, timed out
[76828.299542] nfs: server 127.0.0.1 not responding, timed out
[77018.755771] nfs: server 127.0.0.1 not responding, timed out

@leakingtapan this is somewhat similar to #103 however in my case I am not overusing the stunnel ports, it is a single pod that runs on these nodes where it works at first and stops after a while. I have been using the csi-driver in another environment since it was released without any of these issues it appears to have reared it's head the last AMI rollout I did, any more tips for debugging what could be happening?

leakingtapan · 2019-12-29T00:54:00Z

From what you said, are you using tls for encryption in transit? And do you see the stunnel process in the efs-plugin container of the efs-csi-node- pod? If the stunnel process died, which the nfs client conntects to under the tls mode, it could a cause of such issue.

stefansedich · 2019-12-29T01:30:54Z

@leakingtapan yes to TLS, I see no stunnels at all, if I manually jump into the csi pod on that node and run mount -t efs -o tls fs-xxx:/ /mnt/foo it mounts without issue, however using a pod with a pvc I see nothing happen at all apart from a timeout.

I would expect that after rebooting the pod I could mount volumes again? what else could be broken here? I am seeing the same behavior on all our new clusters now and my old one is humming along without issue which is strange.

I can run my pod with a selector to pin it to the second node in this cluster and I can see the NodePublishVolume come through and everthing work.

However if I pin it to the other node I get nothing, not a single line shown in the efs-plugin container logs apart from the startup logs. Everthing worked on both nodes last night, went to bed woke up and one node had stopped working, I have witnessed this 3 times now where I have replaced all nodes and give it a few hours and one or both of the nodes will just stop working.

stefansedich · 2019-12-29T02:40:27Z

BTW this is the only output I see from a freshly restarted efs-plugin container on the node with issues:

I1228 19:13:07.681410       1 mount_linux.go:174] Cannot run systemd-run, assuming non-systemd OS
I1228 19:13:07.681478       1 mount_linux.go:175] systemd-run failed with: exit status 1
I1228 19:13:07.681489       1 mount_linux.go:176] systemd-run output: Failed to create bus connection: No such file or directory
I1228 19:13:07.681897       1 driver.go:83] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1228 19:13:08.683574       1 node.go:159] NodeGetInfo: called with args

leakingtapan · 2019-12-29T05:39:14Z

what’s the kubelet version of both worker nodes( working one and not working one)? Since mount operation is called from kubelet, do you have kubelet logs for both cases?

stefansedich · 2019-12-29T05:50:25Z

@leakingtapan both running v1.14.8-eks-b8860f, I have logs just let me know what I need to provide, I looked at api logs and could not find anything initiallly odd.

Here are the kubelet logs when I restart the csi pod on the troublesome node, I initially thought the error was something to look deeper into but doing some digging it appears as though it could be "normal" to see that.

I see no useful logs around the time I create the pod apart from the usual pod creation and then the timeout, I see nothing related to the expected NodePublishVolume or any errors related to it.

Dec 29 16:35:48 ip-10-234-129-100 kubelet: E1229 16:35:48.956674    4028 plugin_watcher.go:120] error failed to get plugin info using RPC GetInfo at socket /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration when handling create event: "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock": CREATE
Dec 29 16:35:49 ip-10-234-129-100 containerd: time="2019-12-29T16:35:49.018594055Z" level=info 
Dec 29 16:35:49 ip-10-234-129-100 kubelet: I1229 16:35:49.185928    4028 csi_plugin.go:110] kubernetes.io/csi: Trying to validate a new CSI Driver with name: efs.csi.aws.com endpoint: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
Dec 29 16:35:49 ip-10-234-129-100 kubelet: I1229 16:35:49.185960    4028 csi_plugin.go:131] kubernetes.io/csi: Register new plugin with name: efs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock

leakingtapan · 2019-12-29T22:58:36Z

Ah, I might know the reason why. This seems related to #100 which is the issue you reported a while back. And the fix is only in master branch for now

What's your PV specs that uses the same EFS volume for two PVCs?

stefansedich · 2019-12-29T23:24:08Z

This is definitely odd as I have no issues in another cluster. In this case there is no binding at all to the folder, I can successfully bind it if I pin my pod to node B but it never works pinned to node A. Node A is unable to mount any EFS volume the efs-plugin never sees an NodePublishVolume no matter what I do. If I terminate the instance and let a new one come up things will be fine but after some time will stop working again randomly. I have tried restarting the csi pod to no avail the only fix is to terminate the node.

…

On Sun, Dec 29, 2019, 2:58 PM Cheng Pan ***@***.***> wrote: Ah, I might know the reason why. This seems related to #100 <#100> which is the issue you reported a while back. And the fix is only in master branch for now What's your PV specs that uses the same EFS volume for two PVCs? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#115?email_source=notifications&email_token=AABCUVASDPP62HNF3WE5SILQ3ETR3A5CNFSM4KAPEACKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHZKICQ#issuecomment-569549834>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCUVESE3ERSJXCFS4K5YLQ3ETR3ANCNFSM4KAPEACA> .

stefansedich · 2019-12-29T23:27:26Z

To add too this is not a duplicate PVC or path it is a new test PV/PVC and path that nothing is bound too anywhere.

…

On Sun, Dec 29, 2019, 3:23 PM Stefan Sedich ***@***.***> wrote: This is definitely odd as I have no issues in another cluster. In this case there is no binding at all to the folder, I can successfully bind it if I pin my pod to node B but it never works pinned to node A. Node A is unable to mount any EFS volume the efs-plugin never sees an NodePublishVolume no matter what I do. If I terminate the instance and let a new one come up things will be fine but after some time will stop working again randomly. I have tried restarting the csi pod to no avail the only fix is to terminate the node. On Sun, Dec 29, 2019, 2:58 PM Cheng Pan ***@***.***> wrote: > Ah, I might know the reason why. This seems related to #100 > <#100> which > is the issue you reported a while back. And the fix is only in master > branch for now > > What's your PV specs that uses the same EFS volume for two PVCs? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#115?email_source=notifications&email_token=AABCUVASDPP62HNF3WE5SILQ3ETR3A5CNFSM4KAPEACKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHZKICQ#issuecomment-569549834>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABCUVESE3ERSJXCFS4K5YLQ3ETR3ANCNFSM4KAPEACA> > . >

stefansedich · 2020-01-02T19:16:45Z

@leakingtapan figured I would fire up the latest dev build on one of the bad nodes to see if that helps but it made no difference. Still no NodePublishVolume being called on the plugin, I take there is some bad state in kubelet somewhere causing this not to be called but so far I cannot find what it could be.

leakingtapan · 2020-01-02T21:43:22Z

I suspect it something to do with kubelet. But you also mentioned both are using v1.14.8-eks-b8860f, which is surprising. What's the AMI you are using? Have you find any differences between the two AMIs?

stefansedich · 2020-01-02T21:46:18Z

@leakingtapan all nodes are running the same AMI which is the latest ami-0c13bb9cbfd007e56 in us-west-2. Also surprising to me, today I had to keep moving forward so brought up new nodes in all our clusters but it is likely to stop working again within a a day or so.

I will look at setting more verbose logging for kubelet so I can see if that gives me any more clues as right now I am all out of ideas.

stefansedich · 2020-01-07T17:35:46Z

@leakingtapan I cant explain it fully but things seem stable now. Our clusters were running 1 replica of opa and ever since I re-configured to run multiple instances over AZs I am yet to see this issue and the cluster has been solid with no EFS issues since the last time I brought up new nodes.

I am closing this off for now and will monitoring things going forward.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 28, 2019

stefansedich changed the title ~~Recent timeouts binding to volume on same node~~ Timeouts binding to volume on same node Dec 28, 2019

stefansedich closed this as completed Jan 7, 2020

ironmike-au mentioned this issue Mar 8, 2020

EFS CSI driver can't mount successfully, keeps retrying #141

Closed

leakingtapan mentioned this issue May 17, 2020

Always enabled encryption in transit #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts binding to volume on same node #115

Timeouts binding to volume on same node #115

stefansedich commented Dec 28, 2019 •

edited

Loading

stefansedich commented Dec 28, 2019 •

edited

Loading

stefansedich commented Dec 28, 2019 •

edited

Loading

leakingtapan commented Dec 29, 2019 •

edited

Loading

stefansedich commented Dec 29, 2019 •

edited

Loading

stefansedich commented Dec 29, 2019

leakingtapan commented Dec 29, 2019

stefansedich commented Dec 29, 2019 •

edited

Loading

leakingtapan commented Dec 29, 2019

stefansedich commented Dec 29, 2019 via email

stefansedich commented Dec 29, 2019 via email

stefansedich commented Jan 2, 2020

leakingtapan commented Jan 2, 2020

stefansedich commented Jan 2, 2020

stefansedich commented Jan 7, 2020 •

edited

Loading

Timeouts binding to volume on same node #115

Timeouts binding to volume on same node #115

Comments

stefansedich commented Dec 28, 2019 • edited Loading

stefansedich commented Dec 28, 2019 • edited Loading

stefansedich commented Dec 28, 2019 • edited Loading

leakingtapan commented Dec 29, 2019 • edited Loading

stefansedich commented Dec 29, 2019 • edited Loading

stefansedich commented Dec 29, 2019

leakingtapan commented Dec 29, 2019

stefansedich commented Dec 29, 2019 • edited Loading

leakingtapan commented Dec 29, 2019

stefansedich commented Dec 29, 2019 via email

stefansedich commented Dec 29, 2019 via email

stefansedich commented Jan 2, 2020

leakingtapan commented Jan 2, 2020

stefansedich commented Jan 2, 2020

stefansedich commented Jan 7, 2020 • edited Loading

stefansedich commented Dec 28, 2019 •

edited

Loading

stefansedich commented Dec 28, 2019 •

edited

Loading

stefansedich commented Dec 28, 2019 •

edited

Loading

leakingtapan commented Dec 29, 2019 •

edited

Loading

stefansedich commented Dec 29, 2019 •

edited

Loading

stefansedich commented Dec 29, 2019 •

edited

Loading

stefansedich commented Jan 7, 2020 •

edited

Loading