-
Notifications
You must be signed in to change notification settings - Fork 673
weave-npc blocking connections with valid network policy after a period of time (2.6.0) #3764
Comments
@naemono thanks for reporting the issue
Please gather the weave-npc logs, and ipset dumps. add/detion of entry to ipset's are logged. so we should be able to track under what scenario ipset is going out-of-sync from desired state |
We've also been noticing this for weeks now with the latest weave. In a deployment of n pods, we sometimes have x broken endpoints because, while the pods show as healthy & are there behind the load balancer, network is absolutely blocked (as shown in the logs). Kicking such pods to other nodes, or restarting weave on the node fixes the issue immediately. To be clear, only said pods would have trouble, the other pods on the nodes would have no trouble. It occurs several times week now. We haven't been able to track a reproduction just yet. The weave uptime is ~4mo-6mo. We will attempt at getting the ipsets & ipsets logs, but it generally occurred on production pods, barring us from the luxury of experimenting around. |
@Quentin-M @naemono your clusters are upgraded from previous version of weave-net or new clusters? |
@murali-reddy the cluster @naemono mentions has been upgraded from previous versions of weave. Furthermore, the logs you asked for for updating IP sets, I haven't seen those in the weave-npc logs, do we need to change/update the debugging level to collect that information? |
Dear @murali-reddy, We have been investigating today as we do have a non-prod case right now, where traffic is blocked between pods. As it appears, numbers of our nodes have the IPSets for the From/To selectors as well as the associated iptables rule. Now, the node that hosts the destination pod, does not have either From/To IPSets. Looking at the AddPod/UpdatePod logs across the fleet, every single nodes (40+) got the event for the pod creation. However, looking at the nodes that received AddNetworkPolicy/UpdateNetworkPolicy, we are only count 10 nodes here, thus, also seeing only 10 nodes that created the From/To IPSets (looking for "creating ipset" & the IPSet hash), and the associated iptables rule. Therefore, as the IPSets were not provisioned, neither the From/To pod IPs were added to such non-existing IPSets. As we were looking at other recently created network policies in our logs, we did notice that only 10 nodes, the same 10 nodes, caught the event. Therefore, I am guessing that the NetworkPolicy informer/watch is broken, whereas the Pod informer/watch functions (but unable to act given the required IPSets are unavailable). As we restart weave, all network policies are queried, state is reconciled, fixing the traffic. It is actually pretty clear that this is the issue, as as soon as the events are received, the next step executed by weave is to log the reception of this event, which we do not see in our logs. Because the resync period is set to 0 (as it seems that the updateNetworkPolicy function would deprovision first before provisioning - screwing up the network), then restarting all API servers to force a clean re-establishment of the watch (as the TCP sockets will be clearly broken) would only solve the problem moving forward (i.e. for newly created/updated network policies), but not for network policies created/updated while the watches were broken. Restarting NPC would fix, but assume this will also cause deprovisioning before re-provisioning..? We have way over 30 broken nodes now, and tons, tons of pods. Looks like updateNetworkPolicy is idempotent, so a/ why don't we add a resync period? b/ why would we reset all iptables & all ipsets upon restart rather than letting the reconciliation function do their job? This makes updates, and any restarts (e.g. to fix this bug) harmful as we'll drop packets. We should 100% be able to safely restart NPC without impacting traffic. It seems like a major flaw for any production use cases IHMO.
TL;DR: Loads of weave-npc containers are missing the NetworkPolicy events from the informer, thus not creating IPSets/iptables. There is no resync period defined so it is unable to receive anything whatsoever if the watch gets broken. At least, we would get the events even with a broken watch, every x minutes thanks to the resync. Also, what about RetryWatcher? |
We also have this issue going on in another cluster right now
Would expect to see some rules in the ipset on that node.. weave-npc logs from 10.0.169.28 weave pod has been running for 62 days
|
We went all the way down on this one. So it appears that all 40+ of our weave-npc containers panicked all at once.
The issue lies in the npc/selector#deprovision function. My guess would be that the The reason we've been seeing 10 nodes still working is that those 10 weave-npc containers have been restarted by human interventions (while trying to unblock those nodes), or by other natural causes (e.g. node restart). Beyond protecting the |
@Quentin-M thanks for digging deep and finding the root cause of the issue you are facing So as per the you last comment
I assume this was the reason weave-npc running on the node does not process any events from API server and potentially resulting in traffic from new pods to be dropped. We had this issue reported earlier, not sure in your case cause was same #3407 (comment) But irrespective panic need to be prevented and handled gracefully by weave-npc. Just so you know same logic was there in earlier version as well. So i am not sure why you are hitting this issue only with 2.6. regarding question on reconciliation. I agree weave-npc needs improvements. We have a tracking issues which is not addressed yet. On weave-net pod restart, it should not flush all iptables chain and ipsets which would mean disruption of services. There should be just reconciliation. Also there should be a periodic sync task that does reconciliation which is typical of controllers. |
Dear @murali-reddy, Thanks for your answer.
No, I did not claim that we've only seen the issue in 2.6. Indeed, the
Thanks for confirming the intent. It has been (personally) our biggest worry using Weave, as some of our financial services are sensitive to network blips. The scope of this work (and its required testing) seems fairly large. Is there any plan from WeaveWorks to iterate on this? From our side, if you're fine with it, we'll be happy to submit a short PR to make the controllers recover from panics, as we'd like to fix (in other words: restart) our 30+ broken nodes, without having to handle other restarts if it is related to a bad network policy resource. Edit: Did notice that the NPC controller maintains a mutex before calling any of the functions. Thank god. Seems like the only thing then would be to add a basic existence check. We can talk about the panic recovery, but concerned about whether we'd enter a panic loop at some point in time given the structures would still be initialized. Propagating the panic to restart weave is no-go given it'd reset all network.. Edit2: We note that on our clusters, it takes around 10 seconds right now for NPC to delete & then re-create all the IPsets / rules. This means a 10 seconds network downtime for each of our nodes. And that is, strictly the time it takes between the main function starting & the wrap up of the initialization - not counting any time that may be related to killing the container properly, pulling the image (if necessary). |
FYI, we are also testing what would happen with having a reconciliation loop outside of "0" for pods/namespaces over here, and will loop back... |
@Quentin-M Weave Net is an Open Source project please feel free to submit PR
As of now its not planned for next major release. It may be considered. But if you require a guaranteed fix you can reach to Weaveworks commerical support. Again any PR would be welcome. |
@murali-reddy We've added a reconciliation loop for pods/namespaces/netpols over here, and have it running in a couple non-production clusters, and are not noticing any immediate issues with doing this. Was there any reason in particular that this wasn't initially implemented? I'm just wondering if others that are more intimately familiar with the code could point out some downfalls of this? thanks for any insights you can provide. |
After some internal discussions, we ended up thinking that instead of making the controllers recover from panic, as some of the structures will still be loaded in memory - instead we should consider removing the reset upon startup, and propagate the panic to make the whole weave-npc container crash & restart cleanly. From there, reconciliation should be enough to clean up / catch up / go forward. However, just like @naemono, we are not sure why the reset was in place at all in the beginning (given it obviously creates downtime upon npc restart, which doesn't make much sense - 10secs in our case), so a bit wary. Assume if we were to remove the initial reset, we should either do an initial reconciliation (what is on the host vs. what is desired state), or make the provision/deprovision based on some sort of 3-way decision (what's on the host, what is "current", and is "desired"). We then were looking at the test suite, to see if y'all had some sort of long-running hammering tests (e.g. create/update/delete hundreds of pods / services / netpols w/ various timings/configurations for a while, making sure that everything stayed consistent - to which we could add some chaos), but did not find such test suite (yet) that'd let us make such a drastic change like this comfortably. Thank you for your insight. |
@Quentin-M Agree. We should just perform reconciliation instead of reset at the start of weave-npc pod
Did you made change to informer's resyncPeriod? This will cause load on the API server as informer on each node will relist the API object at the configured duration. What is ideal is we design controller which at periodic interval list the cached objects and perform sync. Either way i think the problem was not missed events from the API server rather the go routine that crashed stopped processing events. @Quentin-M please feel free to subit a PR to perform reconciliatio instead of reset. |
@murali-reddy We're still testing the changes in our clusters, and will report back, likely with a PR that keeps the default at "0", but allows tuning of the reconciliation interval. |
@naemono then we are possibly looking at different issue than what @Quentin-M observed. While reconciliation should help the fix the problem but there must be latent issue which is not updating the ipset appropriatley. It would be nice to get to root cause of the problem. I can not access the logs shared by you. Please see if there is any unexpected deletions from ipset |
@murali-reddy Sorry about the logs https://bb83126f306f400e87721c0482fcd983-logs.s3-us-west-2.amazonaws.com/weave-logs/weave-npc.logs I've verified that these are, indeed, public, and accessible. |
@Quentin-M it would be better if you opened a separate issue since the two threads of conversation are hard to maintain. However it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running. |
@naemono thank you very much for the logs but I think the NPC log does not go back far enough - we only see it blocking things and not what led up to that state. Do you have earlier logs? (Or, if it is reproducible on restart, restart the pod and send us the whole log). |
@bboreham @murali-reddy I see that the logs seems to default to info level
are the npc changes not shown at info level? As mentioned by @ephur earlier, we're not seeing them at all. |
We actually set npc to debug level in the manifest. The problem I was commenting on is that your file starts at a time after the interesting part. |
I'm going to go ahead and close this, and we have recently discovered that the panic referenced by @Quentin-M actually IS the root cause of our issue as well and #3771 is that continued discussion. Opening a PR now to attempt to crash the whole npc app when an internal panic occurs. |
I'm going to re-open this issue to cover the underlying symptom described at #3764 (comment) - a panic causes a hang not a restart. #3771's description covers what should happen after a restart, although the subsequent discussion is mostly about what causes the hang. |
What you expected to happen?
Similar to #3761, we are seeing traffic being blocked by weave-npc, but we are using network policies. I would expect traffic to not be blocked by NPC with valid network policy in place
What happened?
We have seen now, consistently (about once every 1-2 weeks) traffic gets blocked between pods inside of a namespace where traffic was working fine earlier. After we debugged the issue, and saw the ipset's on the host to not have valid entries for the pods, we restart weave on the host, and the ipsets become populated, and traffic continues to flow.
How to reproduce it?
I wish we had an easy way to consistently reproduce this issue, but we are beginning to see this issue nearly every week within one specific cluster.
Anything else we need to know?
cloud provider: aws
custom built cluster using in house automation.
Versions:
Logs:
Unfortunately, these logs do not show the weave logs before restart, but when we run into this issue again (in a week or so), we will get those logs and update this issue
https://gist.github.com/naemono/31df744c7ee6b48dba7b554e06553f4b
When this issue is happening, we begin to see a spike in
weavenpc_blocked_connections_total
from prometheus:The text was updated successfully, but these errors were encountered: