Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts binding to volume on same node #115

Closed
stefansedich opened this issue Dec 28, 2019 · 14 comments
Closed

Timeouts binding to volume on same node #115

stefansedich opened this issue Dec 28, 2019 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@stefansedich
Copy link

stefansedich commented Dec 28, 2019

/kind bug

What happened?

Have started seeing the timeout expired waiting for volumes to attach or mount for pod error when a pod is being restarted and it lands on the same node after I let things sit for some time. This is only recent and it appears ever since I updated to the latest EKS AMI but I cannot prove that for sure at the moment (still digging deeper).

I can have a pod binding /foo using one PVC and even if I try start another pod with a totally seperate PVC mounting say /bah I am still seeing timeouts, if that pod moves to a node without any binds at all everything works.

My current fix is to terminate my nodes and upon coming back up binds will work for some time. For some reason everything will work fine at first I can restart pods hapilly but then if I come back a number of hours later the problem will appear again.

One more thing is looking at the efs-plugin container I see no more NodePublishVolume coming through since the initial binds when I last terminated my nodes to make things happy which is also odd. I attempted to restart the csi pods to no avail the only fix appears to be to bring up new nodes.

What you expected to happen?

For no mount timeouts to occur.

How to reproduce it (as minimally and precisely as possible)?

Create two PVC for the same EFS volume both for seperate folders in the root, create the pods on the same node, the second pod will fail to start due to timeouts.

Anything else we need to know?:

Running in EKS using latest AMI.

Environment

  • Kubernetes version (use kubectl version): v1.14.9-eks-c0eccc
  • Driver version: 0.2.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 28, 2019
@stefansedich stefansedich changed the title Recent timeouts binding to volume on same node Timeouts binding to volume on same node Dec 28, 2019
@stefansedich
Copy link
Author

stefansedich commented Dec 28, 2019

Checked the node in question, saw some dangling nfs mounts which I unmounted, but even after that no mounts were working.

I then restarted the csi container on that node and attempted to mount from efs again and it does not even show any attempted bind in the logs, just silence.

Terminate the node and attempt to bind on a fresh node, and it works as expected.

@stefansedich
Copy link
Author

stefansedich commented Dec 28, 2019

Looking at the node where things are failing I am seeing the following in dmesg too. If I were to exec into the csi pod on the node and run a mount -t efs fs-xxx:/ /mnt/foo it mounts fine.

[76828.295688] nfs: server 127.0.0.1 not responding, timed out
[76828.299542] nfs: server 127.0.0.1 not responding, timed out
[77018.755771] nfs: server 127.0.0.1 not responding, timed out

@leakingtapan this is somewhat similar to #103 however in my case I am not overusing the stunnel ports, it is a single pod that runs on these nodes where it works at first and stops after a while. I have been using the csi-driver in another environment since it was released without any of these issues it appears to have reared it's head the last AMI rollout I did, any more tips for debugging what could be happening?

@leakingtapan
Copy link
Contributor

leakingtapan commented Dec 29, 2019

From what you said, are you using tls for encryption in transit? And do you see the stunnel process in the efs-plugin container of the efs-csi-node- pod? If the stunnel process died, which the nfs client conntects to under the tls mode, it could a cause of such issue.

@stefansedich
Copy link
Author

stefansedich commented Dec 29, 2019

@leakingtapan yes to TLS, I see no stunnels at all, if I manually jump into the csi pod on that node and run mount -t efs -o tls fs-xxx:/ /mnt/foo it mounts without issue, however using a pod with a pvc I see nothing happen at all apart from a timeout.

I would expect that after rebooting the pod I could mount volumes again? what else could be broken here? I am seeing the same behavior on all our new clusters now and my old one is humming along without issue which is strange.

I can run my pod with a selector to pin it to the second node in this cluster and I can see the NodePublishVolume come through and everthing work.

However if I pin it to the other node I get nothing, not a single line shown in the efs-plugin container logs apart from the startup logs. Everthing worked on both nodes last night, went to bed woke up and one node had stopped working, I have witnessed this 3 times now where I have replaced all nodes and give it a few hours and one or both of the nodes will just stop working.

@stefansedich
Copy link
Author

BTW this is the only output I see from a freshly restarted efs-plugin container on the node with issues:

I1228 19:13:07.681410       1 mount_linux.go:174] Cannot run systemd-run, assuming non-systemd OS
I1228 19:13:07.681478       1 mount_linux.go:175] systemd-run failed with: exit status 1
I1228 19:13:07.681489       1 mount_linux.go:176] systemd-run output: Failed to create bus connection: No such file or directory
I1228 19:13:07.681897       1 driver.go:83] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1228 19:13:08.683574       1 node.go:159] NodeGetInfo: called with args

@leakingtapan
Copy link
Contributor

what’s the kubelet version of both worker nodes( working one and not working one)? Since mount operation is called from kubelet, do you have kubelet logs for both cases?

@stefansedich
Copy link
Author

stefansedich commented Dec 29, 2019

@leakingtapan both running v1.14.8-eks-b8860f, I have logs just let me know what I need to provide, I looked at api logs and could not find anything initiallly odd.

Here are the kubelet logs when I restart the csi pod on the troublesome node, I initially thought the error was something to look deeper into but doing some digging it appears as though it could be "normal" to see that.

I see no useful logs around the time I create the pod apart from the usual pod creation and then the timeout, I see nothing related to the expected NodePublishVolume or any errors related to it.

Dec 29 16:35:48 ip-10-234-129-100 kubelet: E1229 16:35:48.956674    4028 plugin_watcher.go:120] error failed to get plugin info using RPC GetInfo at socket /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration when handling create event: "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock": CREATE
Dec 29 16:35:49 ip-10-234-129-100 containerd: time="2019-12-29T16:35:49.018594055Z" level=info 
Dec 29 16:35:49 ip-10-234-129-100 kubelet: I1229 16:35:49.185928    4028 csi_plugin.go:110] kubernetes.io/csi: Trying to validate a new CSI Driver with name: efs.csi.aws.com endpoint: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
Dec 29 16:35:49 ip-10-234-129-100 kubelet: I1229 16:35:49.185960    4028 csi_plugin.go:131] kubernetes.io/csi: Register new plugin with name: efs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock

@leakingtapan
Copy link
Contributor

Ah, I might know the reason why. This seems related to #100 which is the issue you reported a while back. And the fix is only in master branch for now

What's your PV specs that uses the same EFS volume for two PVCs?

@stefansedich
Copy link
Author

stefansedich commented Dec 29, 2019 via email

@stefansedich
Copy link
Author

stefansedich commented Dec 29, 2019 via email

@stefansedich
Copy link
Author

@leakingtapan figured I would fire up the latest dev build on one of the bad nodes to see if that helps but it made no difference. Still no NodePublishVolume being called on the plugin, I take there is some bad state in kubelet somewhere causing this not to be called but so far I cannot find what it could be.

@leakingtapan
Copy link
Contributor

I suspect it something to do with kubelet. But you also mentioned both are using v1.14.8-eks-b8860f, which is surprising. What's the AMI you are using? Have you find any differences between the two AMIs?

@stefansedich
Copy link
Author

@leakingtapan all nodes are running the same AMI which is the latest ami-0c13bb9cbfd007e56 in us-west-2. Also surprising to me, today I had to keep moving forward so brought up new nodes in all our clusters but it is likely to stop working again within a a day or so.

I will look at setting more verbose logging for kubelet so I can see if that gives me any more clues as right now I am all out of ideas.

@stefansedich
Copy link
Author

stefansedich commented Jan 7, 2020

@leakingtapan I cant explain it fully but things seem stable now. Our clusters were running 1 replica of opa and ever since I re-configured to run multiple instances over AZs I am yet to see this issue and the cluster has been solid with no EFS issues since the last time I brought up new nodes.

I am closing this off for now and will monitoring things going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants