-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeouts binding to volume on same node #115
Comments
Checked the node in question, saw some dangling nfs mounts which I unmounted, but even after that no mounts were working. I then restarted the csi container on that node and attempted to mount from efs again and it does not even show any attempted bind in the logs, just silence. Terminate the node and attempt to bind on a fresh node, and it works as expected. |
Looking at the node where things are failing I am seeing the following in dmesg too. If I were to exec into the csi pod on the node and run a
@leakingtapan this is somewhat similar to #103 however in my case I am not overusing the stunnel ports, it is a single pod that runs on these nodes where it works at first and stops after a while. I have been using the csi-driver in another environment since it was released without any of these issues it appears to have reared it's head the last AMI rollout I did, any more tips for debugging what could be happening? |
From what you said, are you using |
@leakingtapan yes to TLS, I see no stunnels at all, if I manually jump into the csi pod on that node and run I would expect that after rebooting the pod I could mount volumes again? what else could be broken here? I am seeing the same behavior on all our new clusters now and my old one is humming along without issue which is strange. I can run my pod with a selector to pin it to the second node in this cluster and I can see the NodePublishVolume come through and everthing work. However if I pin it to the other node I get nothing, not a single line shown in the efs-plugin container logs apart from the startup logs. Everthing worked on both nodes last night, went to bed woke up and one node had stopped working, I have witnessed this 3 times now where I have replaced all nodes and give it a few hours and one or both of the nodes will just stop working. |
BTW this is the only output I see from a freshly restarted efs-plugin container on the node with issues:
|
what’s the kubelet version of both worker nodes( working one and not working one)? Since mount operation is called from kubelet, do you have kubelet logs for both cases? |
@leakingtapan both running Here are the kubelet logs when I restart the csi pod on the troublesome node, I initially thought the error was something to look deeper into but doing some digging it appears as though it could be "normal" to see that. I see no useful logs around the time I create the pod apart from the usual pod creation and then the timeout, I see nothing related to the expected NodePublishVolume or any errors related to it.
|
Ah, I might know the reason why. This seems related to #100 which is the issue you reported a while back. And the fix is only in master branch for now What's your PV specs that uses the same EFS volume for two PVCs? |
This is definitely odd as I have no issues in another cluster. In this case
there is no binding at all to the folder, I can successfully bind it if I
pin my pod to node B but it never works pinned to node A.
Node A is unable to mount any EFS volume the efs-plugin never sees an
NodePublishVolume no matter what I do. If I terminate the instance and let
a new one come up things will be fine but after some time will stop working
again randomly.
I have tried restarting the csi pod to no avail the only fix is to
terminate the node.
…On Sun, Dec 29, 2019, 2:58 PM Cheng Pan ***@***.***> wrote:
Ah, I might know the reason why. This seems related to #100
<#100> which
is the issue you reported a while back. And the fix is only in master
branch for now
What's your PV specs that uses the same EFS volume for two PVCs?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#115?email_source=notifications&email_token=AABCUVASDPP62HNF3WE5SILQ3ETR3A5CNFSM4KAPEACKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHZKICQ#issuecomment-569549834>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABCUVESE3ERSJXCFS4K5YLQ3ETR3ANCNFSM4KAPEACA>
.
|
To add too this is not a duplicate PVC or path it is a new test PV/PVC and
path that nothing is bound too anywhere.
…On Sun, Dec 29, 2019, 3:23 PM Stefan Sedich ***@***.***> wrote:
This is definitely odd as I have no issues in another cluster. In this
case there is no binding at all to the folder, I can successfully bind it
if I pin my pod to node B but it never works pinned to node A.
Node A is unable to mount any EFS volume the efs-plugin never sees an
NodePublishVolume no matter what I do. If I terminate the instance and let
a new one come up things will be fine but after some time will stop working
again randomly.
I have tried restarting the csi pod to no avail the only fix is to
terminate the node.
On Sun, Dec 29, 2019, 2:58 PM Cheng Pan ***@***.***> wrote:
> Ah, I might know the reason why. This seems related to #100
> <#100> which
> is the issue you reported a while back. And the fix is only in master
> branch for now
>
> What's your PV specs that uses the same EFS volume for two PVCs?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#115?email_source=notifications&email_token=AABCUVASDPP62HNF3WE5SILQ3ETR3A5CNFSM4KAPEACKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHZKICQ#issuecomment-569549834>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABCUVESE3ERSJXCFS4K5YLQ3ETR3ANCNFSM4KAPEACA>
> .
>
|
@leakingtapan figured I would fire up the latest dev build on one of the bad nodes to see if that helps but it made no difference. Still no NodePublishVolume being called on the plugin, I take there is some bad state in kubelet somewhere causing this not to be called but so far I cannot find what it could be. |
I suspect it something to do with kubelet. But you also mentioned both are using |
@leakingtapan all nodes are running the same AMI which is the latest I will look at setting more verbose logging for kubelet so I can see if that gives me any more clues as right now I am all out of ideas. |
@leakingtapan I cant explain it fully but things seem stable now. Our clusters were running 1 replica of opa and ever since I re-configured to run multiple instances over AZs I am yet to see this issue and the cluster has been solid with no EFS issues since the last time I brought up new nodes. I am closing this off for now and will monitoring things going forward. |
/kind bug
What happened?
Have started seeing the
timeout expired waiting for volumes to attach or mount for pod
error when a pod is being restarted and it lands on the same node after I let things sit for some time. This is only recent and it appears ever since I updated to the latest EKS AMI but I cannot prove that for sure at the moment (still digging deeper).I can have a pod binding /foo using one PVC and even if I try start another pod with a totally seperate PVC mounting say /bah I am still seeing timeouts, if that pod moves to a node without any binds at all everything works.
My current fix is to terminate my nodes and upon coming back up binds will work for some time. For some reason everything will work fine at first I can restart pods hapilly but then if I come back a number of hours later the problem will appear again.
One more thing is looking at the efs-plugin container I see no more
NodePublishVolume
coming through since the initial binds when I last terminated my nodes to make things happy which is also odd. I attempted to restart the csi pods to no avail the only fix appears to be to bring up new nodes.What you expected to happen?
For no mount timeouts to occur.
How to reproduce it (as minimally and precisely as possible)?
Create two PVC for the same EFS volume both for seperate folders in the root, create the pods on the same node, the second pod will fail to start due to timeouts.
Anything else we need to know?:
Running in EKS using latest AMI.
Environment
kubectl version
): v1.14.9-eks-c0ecccThe text was updated successfully, but these errors were encountered: