-
Notifications
You must be signed in to change notification settings - Fork 674
Do something more useful when the weave bridge is DOWN #3133
Comments
Yeah, same here.
Affected since ufw installation. |
@Bregor do you think it goes DOWN mid-session or could it be just on reboot? |
I'm pretty sure it's in mid-session. Moreover: node reboot not "cures" this condition. The guaranteed way to fix it for some time is to download |
|
This time right after host restart |
We are also facing the same issue and it mostly happens in the new scaled up node. Not working node
Working node
Output of
@bboreham what should we do to fix it. |
We just stumbled across this twice in less than a week. Unclear if it was before or after restart. |
@deitch do you have |
I can get it. Any particular |
For the beginning, it'd be good to see a message of weave bridge being brought down in |
@brb I am working with @deitch on this. Looks like after a system reboot the weave interface never was brought back up until I manually did it today.
Here is also the dmesg
|
Adding to it: when we do bring the |
FWIW:
On underlying host:
In a container:
What would be really helpful:
|
Bingo. The following iptables entries were not in the rebooted node. When added, everything came back to life. Provided by
|
Thanks for the logs and the investigation! The weave bridge should not survive reboots, and it should be created by weave-kube (when it starts). Does In any case, we should perhaps monitor via netlink subscribe when the bridge goes down and do something more useful. |
Quite welcome. I figure the better detail I get, the faster we both get it resolved.
So when the node starts, the
Output of
|
FWIW, here are the 1st ~100 lines of the |
The current code skips the initialization steps (incl. setting the required iptables chains and setting the bridge interface up) if the bridge exists (see https://github.com/weaveworks/weave/blob/v2.1.3/net/bridge.go#L231) which is a bug. So, if the container
These are handled independently by the Just noting, that you are running quite an ancient version of weave - v2.0.1, while v2.1.3 exists. Mind upgrading? I suggest the following actions:
|
Already planned to upgrade in next few days. Been backed up with other requirements.
Aha. That explains, well, everything. :-)
And maybe have Essentially, when the
From a practical operational standpoint, what steps can I take on host startup so that a reboot is non-disabling? What can I do so that reboot = weave behaves correctly? On a reboot, I already have |
Yep! That's what I meant by "do missing steps".
Ideally - none, as the bridge interface should not survive system reboots. In your case, I suspect that
Both interfaces are created during the initialization, and there is a ~40sec gap in-between. You should be able to access the previous |
We did last night. Ironically, it caused a minor problem, as the (now-properly-working) NPC blocked some services from working. We knew it wasn't working 100% with the defaults, but put it in place anyways so it would work when it did. But we misconfigured it. Oops. :-)
But we aren't in an ideal world. Can I put an
That seems strange to me. In our case, we had the host up and running for quite some time, Unfortunately, getting the previous pod isn't going to help anymore. We upgraded to 2.1.3 last night, so you would need to go back several pods, which won't help. Well, we have fluentd sending to loggly, but when the bridge was down, the fluentd pod could not communicate out, so those are gone. So:
|
I meant for the first time after a reboot.
Good. Have you experienced the problem with 2.1.3?
I'm hoping to have a PR ready next week.
I don't see any easy way. Before starting the weave pod, you'd need to call destroy_bridge, which is not exported, and afterwards to force k8s to CNI-ADD all existing containers on the host. |
Ah, right. Got it.
Haven't had a reboot since we upgraded a few days ago. We can try it, but since we know the cause, and it isn't consistent, I am not sure the test will be scientific. :-)
Well, if it is that soon, we will live with the risk. Anything we do on our own infra is going to take enough time to validate, that we will be better off with 2.1.4 (or whichever). I probably should do a PR myself - I have added one or two to weave net - but out on holiday Monday-Tuesday and totally overloaded with end of year stuff. May I ask that you flag the PR on this issue so interested parties can track it? Thanks! |
Sure, #3204 |
Thanks! |
@deitch our issue gets fixed not by reboot every-time, but when we delete all the master and nodes. What should i do to fix this issue on my own. Not sure when the release is happening for this. And our production can get impacted cause of this. Currently the problem is in staging where there is huge up and down of nodes. |
@alok87 I don't understand. Do you mean a new worker node comes up, yet it fails to join the existing weave network? |
@deitch Yes a worker node comes up and has the |
|
@deitch This is our weave DS - https://gist.github.com/alok87/07a2ea274a8962726cb7e875c5ad5887 |
@brb we are at This was exactly happening after upgrading cluster 1.7.6 with weave-2.1.3
Restarting the kube dns fixes it. Why is kube dns restart required after weave upgrade? |
@aloka87 I suggest you to open a separate issue, and please don't forget to include as much info as possible (e.g. the restarting weave pod logs). |
Do not skip bridge creation if bridge exists
is found to be in DOWN state Fixes #3133 Do something more useful when the weave bridge is DOWN On Weave restart there is already logic in place to create the bridge interface and bring it up, this fix only monotors the weave bridge interface and logs error if it is in DOWN state
is found to be in DOWN state Fixes #3133 Do something more useful when the weave bridge is DOWN On Weave restart there is already logic in place to create the bridge interface and bring it up, this fix only monotors the weave bridge interface and logs error if it is in DOWN state
is found to be in DOWN state Fixes #3133 Do something more useful when the weave bridge is DOWN On Weave restart there is already logic in place to create the bridge interface and bring it up, this fix only monotors the weave bridge interface and logs error if it is in DOWN state
Monitor and throw error message in the logs if Weave bridge interface is down Fix #3133
Encountered during an instance of #2998 - nothing was working on one node because its bridge was DOWN.
Not sure if the best thing to do is to make this more clear to the administrator, or try to mend the bridge so it can be UP.
Is it always a bug? Are there reasons for the machine owner to deliberately set it DOWN?
The text was updated successfully, but these errors were encountered: