Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

weave-npc blocking connections with valid network policy after a period of time (2.6.0) #3764

Closed
naemono opened this issue Feb 4, 2020 · 23 comments · Fixed by #3841
Closed

Comments

@naemono
Copy link
Contributor

naemono commented Feb 4, 2020

What you expected to happen?

Similar to #3761, we are seeing traffic being blocked by weave-npc, but we are using network policies. I would expect traffic to not be blocked by NPC with valid network policy in place

What happened?

We have seen now, consistently (about once every 1-2 weeks) traffic gets blocked between pods inside of a namespace where traffic was working fine earlier. After we debugged the issue, and saw the ipset's on the host to not have valid entries for the pods, we restart weave on the host, and the ipsets become populated, and traffic continues to flow.

How to reproduce it?

I wish we had an easy way to consistently reproduce this issue, but we are beginning to see this issue nearly every week within one specific cluster.

Anything else we need to know?

cloud provider: aws
custom built cluster using in house automation.

Versions:

# ./weave version
weave script 2.6.0
# docker version
Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea838
 Built:             Wed Nov 13 07:29:52 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea838
  Built:            Wed Nov 13 07:28:22 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
# uname -a
Linux ip-10-0-173-150 4.15.0-1056-aws #58-Ubuntu SMP Tue Nov 26 15:14:34 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T16:54:35Z", GoVersion:"go1.12.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Unfortunately, these logs do not show the weave logs before restart, but when we run into this issue again (in a week or so), we will get those logs and update this issue

https://gist.github.com/naemono/31df744c7ee6b48dba7b554e06553f4b

When this issue is happening, we begin to see a spike in weavenpc_blocked_connections_total from prometheus:
image

@murali-reddy
Copy link
Contributor

@naemono thanks for reporting the issue

After we debugged the issue, and saw the ipset's on the host to not have valid entries for the pods,

when we run into this issue again (in a week or so), we will get those logs and update this issue

Please gather the weave-npc logs, and ipset dumps. add/detion of entry to ipset's are logged. so we should be able to track under what scenario ipset is going out-of-sync from desired state

@Quentin-M
Copy link

Quentin-M commented Feb 4, 2020

We've also been noticing this for weeks now with the latest weave. In a deployment of n pods, we sometimes have x broken endpoints because, while the pods show as healthy & are there behind the load balancer, network is absolutely blocked (as shown in the logs). Kicking such pods to other nodes, or restarting weave on the node fixes the issue immediately. To be clear, only said pods would have trouble, the other pods on the nodes would have no trouble. It occurs several times week now. We haven't been able to track a reproduction just yet. The weave uptime is ~4mo-6mo. We will attempt at getting the ipsets & ipsets logs, but it generally occurred on production pods, barring us from the luxury of experimenting around.

@murali-reddy
Copy link
Contributor

@Quentin-M @naemono your clusters are upgraded from previous version of weave-net or new clusters?

@ephur
Copy link

ephur commented Feb 4, 2020

@murali-reddy the cluster @naemono mentions has been upgraded from previous versions of weave.

Furthermore, the logs you asked for for updating IP sets, I haven't seen those in the weave-npc logs, do we need to change/update the debugging level to collect that information?

@Quentin-M
Copy link

Quentin-M commented Feb 4, 2020

Dear @murali-reddy,

We have been investigating today as we do have a non-prod case right now, where traffic is blocked between pods.

As it appears, numbers of our nodes have the IPSets for the From/To selectors as well as the associated iptables rule. Now, the node that hosts the destination pod, does not have either From/To IPSets. Looking at the AddPod/UpdatePod logs across the fleet, every single nodes (40+) got the event for the pod creation. However, looking at the nodes that received AddNetworkPolicy/UpdateNetworkPolicy, we are only count 10 nodes here, thus, also seeing only 10 nodes that created the From/To IPSets (looking for "creating ipset" & the IPSet hash), and the associated iptables rule. Therefore, as the IPSets were not provisioned, neither the From/To pod IPs were added to such non-existing IPSets. As we were looking at other recently created network policies in our logs, we did notice that only 10 nodes, the same 10 nodes, caught the event.

Therefore, I am guessing that the NetworkPolicy informer/watch is broken, whereas the Pod informer/watch functions (but unable to act given the required IPSets are unavailable). As we restart weave, all network policies are queried, state is reconciled, fixing the traffic. It is actually pretty clear that this is the issue, as as soon as the events are received, the next step executed by weave is to log the reception of this event, which we do not see in our logs.

Because the resync period is set to 0 (as it seems that the updateNetworkPolicy function would deprovision first before provisioning - screwing up the network), then restarting all API servers to force a clean re-establishment of the watch (as the TCP sockets will be clearly broken) would only solve the problem moving forward (i.e. for newly created/updated network policies), but not for network policies created/updated while the watches were broken. Restarting NPC would fix, but assume this will also cause deprovisioning before re-provisioning..? We have way over 30 broken nodes now, and tons, tons of pods.

Looks like updateNetworkPolicy is idempotent, so a/ why don't we add a resync period? b/ why would we reset all iptables & all ipsets upon restart rather than letting the reconciliation function do their job? This makes updates, and any restarts (e.g. to fix this bug) harmful as we'll drop packets. We should 100% be able to safely restart NPC without impacting traffic. It seems like a major flaw for any production use cases IHMO.

	handleError(resetIPTables(ipt))
	handleError(resetIPSets(ips))
	handleError(createBaseRules(ipt, ips))

TL;DR: Loads of weave-npc containers are missing the NetworkPolicy events from the informer, thus not creating IPSets/iptables. There is no resync period defined so it is unable to receive anything whatsoever if the watch gets broken. At least, we would get the events even with a broken watch, every x minutes thanks to the resync. Also, what about RetryWatcher?

@naemono
Copy link
Contributor Author

naemono commented Feb 4, 2020

We also have this issue going on in another cluster right now

# Log from application
[cdb-7a593229-07f8-493f-91a0-4adea89eaf45-0] W200204 20:43:17.899119 5774009 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {cdb-7a593229-07f8-493f-91a0-4adea89eaf45-2.cdb-7a593229-07f8-493f-91a0-4adea89eaf45-internal.7a593229-07f8-493f-91a0-4adea89eaf45:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.16.152.4:26257: i/o timeout". Reconnecting...

# so -0 of the statefulset, cannot speak to -2 member on node 10.0.169.28
$ kubectl get pods -n 7a593229-07f8-493f-91a0-4adea89eaf45 -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
cdb-7a593229-07f8-493f-91a0-4adea89eaf45-0   2/2     Running   0          4d3h   172.16.72.4    ip-10-0-141-103.us-west-2.compute.internal   <none>           <none>
cdb-7a593229-07f8-493f-91a0-4adea89eaf45-1   2/2     Running   0          4d3h   172.16.112.2   ip-10-0-152-110.us-west-2.compute.internal   <none>           <none>
cdb-7a593229-07f8-493f-91a0-4adea89eaf45-2   2/2     Running   0          4d3h   172.16.152.4   ip-10-0-169-28.us-west-2.compute.internal    <none>           <none>

root@ip-10-0-169-28:~# iptables -L -nv | grep 7a593229-07f8-493f-91a0-4adea89eaf45
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set weave-cN1skCFLoaYj*t2Ym.tmuLY%4 dst /* DefaultAllow ingress isolation for namespace: 7a593229-07f8-493f-91a0-4adea89eaf45 */
    0     0 WEAVE-NPC-EGRESS-ACCEPT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set weave-~9IS!W{C7.msGij/f[%w27j?u src /* DefaultAllow egress isolation for namespace: 7a593229-07f8-493f-91a0-4adea89eaf45 */
    0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set weave-~9IS!W{C7.msGij/f[%w27j?u src /* DefaultAllow egress isolation for namespace: 7a593229-07f8-493f-91a0-4adea89eaf45 */
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set weave-+d#CMWwwpfatM/b:D@7LB%VD8 src match-set weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx dst /* namespaces: selector: allow-namespace-in-default-deny.organization.cloud -> pods: namespace: 7a593229-07f8-493f-91a0-4adea89eaf45, selector:  (ingress) */
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set weave-0d{4DK:*zt}w#16(f8?tCqqVR src match-set weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx dst /* namespaces: selector: name.organization.cloud=7a593229-07f8-493f-91a0-4adea89eaf45 -> pods: namespace: 7a593229-07f8-493f-91a0-4adea89eaf45, selector:  (ingress) */
    0     0 ACCEPT     all  --  *      *       172.16.144.0         0.0.0.0/0            match-set weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx dst /* cidr: 172.16.144.0/32 -> pods: namespace: 7a593229-07f8-493f-91a0-4adea89eaf45, selector:  (ingress) */
    0     0 ACCEPT     all  --  *      *       172.16.0.1           0.0.0.0/0            match-set weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx dst /* cidr: 172.16.0.1/32 -> pods: namespace: 7a593229-07f8-493f-91a0-4adea89eaf45, selector:  (ingress) */
    0     0 ACCEPT     all  --  *      *       172.16.192.0         0.0.0.0/0            match-set weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx dst /* cidr: 172.16.192.0/32 -> pods: namespace: 7a593229-07f8-493f-91a0-4adea89eaf45, selector:  (ingress) */
root@ip-10-0-169-28:~# ipset list 'weave-0d{4DK:*zt}w#16(f8?tCqqVR'
Name: weave-0d{4DK:*zt}w#16(f8?tCqqVR
Type: list:set
Revision: 3
Header: size 1024 comment
Size in memory: 200
References: 1
Number of entries: 1
Members:
weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx comment "namespace: 7a593229-07f8-493f-91a0-4adea89eaf45"
root@ip-10-0-169-28:~# ipset list 'weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx'
Name: weave-{V*Z4^{Xw5:O?*rK*MP_#yTZx
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 65536 comment
Size in memory: 88
References: 6
Number of entries: 0
Members:
root@ip-10-0-169-28:~#

Would expect to see some rules in the ipset on that node..

weave-npc logs from 10.0.169.28
https://scpycae8-cluster-storage.s3-us-west-2.amazonaws.com/weave-logs/weave-npc.logs
weave logs
https://scpycae8-cluster-storage.s3-us-west-2.amazonaws.com/weave-logs/weave.logs

weave pod has been running for 62 days

$ kubectl get pod -n kube-system -o wide | grep  ip-10-0-169-28
weave-net-mwrmn                                                      2/2     Running   0          62d    10.0.169.28     ip-10-0-169-28.us-west-2.compute.internal    <none>           <none>

@Quentin-M
Copy link

We went all the way down on this one.

So it appears that all 40+ of our weave-npc containers panicked all at once.

INFO: 2019/12/18 23:29:45.395235 EVENT UpdateNetworkPolicy <redacted>
INFO: 2019/12/18 23:29:45.395556 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.400406 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.404054 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.407970 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.411377 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.415293 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.419160 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.422601 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.425904 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.429113 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.432321 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.435514 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.438740 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.441963 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.445437 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.448684 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.452192 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.455496 deleting rule <redacted> from "WEAVE-NPC-INGRESS" chain
INFO: 2019/12/18 23:29:45.458663 destroying ipset: &npc.selectorSpec{<redacted>}
ERROR: logging before flag.Parse: E1218 23:29:45.469138   31331 runtime.go:66] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
        /usr/local/go/src/runtime/asm_amd64.s:573
        /usr/local/go/src/runtime/panic.go:502
        /usr/local/go/src/runtime/hashmap.go:507
        /go/src/github.com/weaveworks/weave/npc/selector.go:157
        /go/src/github.com/weaveworks/weave/npc/namespace.go:347
        /go/src/github.com/weaveworks/weave/npc/controller.go:146
        /go/src/github.com/weaveworks/weave/npc/controller.go:78
        /go/src/github.com/weaveworks/weave/npc/controller.go:145
        /go/src/github.com/weaveworks/weave/prog/weave-npc/main.go:316
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/controller.go:202
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/controller.go:309
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:444
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:444
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/controller.go:150
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/controller.go:124
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/client-go/tools/cache/controller.go:124
        /usr/local/go/src/runtime/asm_amd64.s:2361

The issue lies in the npc/selector#deprovision function. My guess would be that the *selectorSpec was changed by an informer in another goroutine as the structure is not deep copied nor locked, and therefore made the loop crash. And then, as you can see in the trace, this panic bubbles up all the way back to the npController.Run internal loop, which crashes the whole goroutine dealing with Network Policies, without making weave-npc itself crash/restart - leaving you with a weave-npc working with 2 controllers instead of 3 (or only one if you consider that the Pod controller can't function without the IPSets created by the Network Policy controller).

The reason we've been seeing 10 nodes still working is that those 10 weave-npc containers have been restarted by human interventions (while trying to unblock those nodes), or by other natural causes (e.g. node restart).

Beyond protecting the deprovision function from panicking with the appropriate locks / deep copies, I suggest either a/ we make the controllers recover from the panics & restart b/ we make the whole weave-npc binary crash from any one panic from the controllers. Solution a/ is better as it avoids resetting all the IPSets/rules/etc, however b/ may allow for a clean start (no cache, no broken state, etc). We note however that if we are to do b/, as all of our weave-npc nodes crashed all at once, we would have suffered a cluster-wide network downtime as weave-npc would be clearing / resetting everything, which would be really - really bad. (why do you reset upon start again instead of reconciling??).

@murali-reddy
Copy link
Contributor

@Quentin-M thanks for digging deep and finding the root cause of the issue you are facing

So as per the you last comment

crashes the whole goroutine dealing with Network Policies, without making weave-npc itself crash/restart - leaving you with a weave-npc working with 2 controllers instead of 3 (or only one if you consider that the Pod controller can't function without the IPSets created by the Network Policy controller

I assume this was the reason weave-npc running on the node does not process any events from API server and potentially resulting in traffic from new pods to be dropped. We had this issue reported earlier, not sure in your case cause was same #3407 (comment)

But irrespective panic need to be prevented and handled gracefully by weave-npc. Just so you know same logic was there in earlier version as well. So i am not sure why you are hitting this issue only with 2.6.

regarding question on reconciliation. I agree weave-npc needs improvements. We have a tracking issues which is not addressed yet.

#3586

On weave-net pod restart, it should not flush all iptables chain and ipsets which would mean disruption of services. There should be just reconciliation. Also there should be a periodic sync task that does reconciliation which is typical of controllers.

@Quentin-M
Copy link

Quentin-M commented Feb 5, 2020

Dear @murali-reddy,

Thanks for your answer.

But irrespective panic need to be prevented and handled gracefully by weave-npc. Just so you know same logic was there in earlier version as well. So i am not sure why you are hitting this issue only with 2.6.

No, I did not claim that we've only seen the issue in 2.6. Indeed, the deprovision code has been unchanged for a long-time. I imagine that the conditions creating the panic in the function only seldom occurs & requires good timing - which would be why the issue hasn't been that widespread.

On weave-net pod restart, it should not flush all iptables chain and ipsets which would mean disruption of services. There should be just reconciliation. Also there should be a periodic sync task that does reconciliation which is typical of controllers.

Thanks for confirming the intent. It has been (personally) our biggest worry using Weave, as some of our financial services are sensitive to network blips. The scope of this work (and its required testing) seems fairly large. Is there any plan from WeaveWorks to iterate on this?

From our side, if you're fine with it, we'll be happy to submit a short PR to make the controllers recover from panics, as we'd like to fix (in other words: restart) our 30+ broken nodes, without having to handle other restarts if it is related to a bad network policy resource.

Edit: Did notice that the NPC controller maintains a mutex before calling any of the functions. Thank god. Seems like the only thing then would be to add a basic existence check. We can talk about the panic recovery, but concerned about whether we'd enter a panic loop at some point in time given the structures would still be initialized. Propagating the panic to restart weave is no-go given it'd reset all network..

Edit2: We note that on our clusters, it takes around 10 seconds right now for NPC to delete & then re-create all the IPsets / rules. This means a 10 seconds network downtime for each of our nodes. And that is, strictly the time it takes between the main function starting & the wrap up of the initialization - not counting any time that may be related to killing the container properly, pulling the image (if necessary).

@naemono
Copy link
Contributor Author

naemono commented Feb 5, 2020

FYI, we are also testing what would happen with having a reconciliation loop outside of "0" for pods/namespaces over here, and will loop back...

@murali-reddy
Copy link
Contributor

From our side, if you're fine with it,we'll be happy to submit a short PR to make the controllers recover from panics

@Quentin-M Weave Net is an Open Source project please feel free to submit PR

Thanks for confirming the intent. It has been (personally) our biggest worry using Weave, as some of our financial services are sensitive to network blips. The scope of this work (and its required testing) seems fairly large. Is there any plan from WeaveWorks to iterate on this?

As of now its not planned for next major release. It may be considered. But if you require a guaranteed fix you can reach to Weaveworks commerical support. Again any PR would be welcome.

@naemono
Copy link
Contributor Author

naemono commented Feb 7, 2020

@murali-reddy We've added a reconciliation loop for pods/namespaces/netpols over here, and have it running in a couple non-production clusters, and are not noticing any immediate issues with doing this. Was there any reason in particular that this wasn't initially implemented? I'm just wondering if others that are more intimately familiar with the code could point out some downfalls of this? thanks for any insights you can provide.

@Quentin-M
Copy link

Quentin-M commented Feb 7, 2020

From our side, if you're fine with it,we'll be happy to submit a short PR to make the controllers recover from panics

After some internal discussions, we ended up thinking that instead of making the controllers recover from panic, as some of the structures will still be loaded in memory - instead we should consider removing the reset upon startup, and propagate the panic to make the whole weave-npc container crash & restart cleanly. From there, reconciliation should be enough to clean up / catch up / go forward.

However, just like @naemono, we are not sure why the reset was in place at all in the beginning (given it obviously creates downtime upon npc restart, which doesn't make much sense - 10secs in our case), so a bit wary. Assume if we were to remove the initial reset, we should either do an initial reconciliation (what is on the host vs. what is desired state), or make the provision/deprovision based on some sort of 3-way decision (what's on the host, what is "current", and is "desired").

We then were looking at the test suite, to see if y'all had some sort of long-running hammering tests (e.g. create/update/delete hundreds of pods / services / netpols w/ various timings/configurations for a while, making sure that everything stayed consistent - to which we could add some chaos), but did not find such test suite (yet) that'd let us make such a drastic change like this comfortably.

Thank you for your insight.

@murali-reddy
Copy link
Contributor

However, just like @naemono, we are not sure why the reset was in place at all in the beginning (given it obviously creates downtime upon npc restart, which doesn't make much sense - 10secs in our case), so a bit wary.

@Quentin-M Agree. We should just perform reconciliation instead of reset at the start of weave-npc pod

We've added a reconciliation loop for pods/namespaces/netpols over here, and have it running in a couple non-production clusters, and are not noticing any immediate issues with doing this. Was there any reason in particular that this wasn't initially implemented? I'm just wondering if others that are more intimately familiar with the code could point out some downfalls of this? thanks for any insights you can provide.

Did you made change to informer's resyncPeriod? This will cause load on the API server as informer on each node will relist the API object at the configured duration. What is ideal is we design controller which at periodic interval list the cached objects and perform sync.

Either way i think the problem was not missed events from the API server rather the go routine that crashed stopped processing events. @Quentin-M please feel free to subit a PR to perform reconciliatio instead of reset.

@naemono
Copy link
Contributor Author

naemono commented Feb 10, 2020

@murali-reddy
Either way i think the problem was not missed events from the API server rather the go routine that crashed stopped processing events
Strange thing is, we're not seeing the crashing, we are only seeing the network anomalies from the missing rules, which is why we're testing having a reconciliation period. If the only downside is increased apiserver load, I think that's much preferred than having missed rules inside the cluster that are causing failing network communication.

We're still testing the changes in our clusters, and will report back, likely with a PR that keeps the default at "0", but allows tuning of the reconciliation interval.

@murali-reddy
Copy link
Contributor

Strange thing is, we're not seeing the crashing, we are only seeing the network anomalies from the missing rules,

@naemono then we are possibly looking at different issue than what @Quentin-M observed. While reconciliation should help the fix the problem but there must be latent issue which is not updating the ipset appropriatley. It would be nice to get to root cause of the problem. I can not access the logs shared by you. Please see if there is any unexpected deletions from ipset

@naemono
Copy link
Contributor Author

naemono commented Feb 11, 2020

@bboreham
Copy link
Contributor

@Quentin-M it would be better if you opened a separate issue since the two threads of conversation are hard to maintain.
The code reporting your panic is here:
https://github.com/kubernetes/apimachinery/blob/210bcde8e5a7715fe776460ca1df1a3f2738725e/pkg/util/runtime/runtime.go#L66

However it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running.

@bboreham
Copy link
Contributor

@naemono thank you very much for the logs but I think the NPC log does not go back far enough - we only see it blocking things and not what led up to that state. Do you have earlier logs? (Or, if it is reproducible on restart, restart the pod and send us the whole log).

@naemono
Copy link
Contributor Author

naemono commented Feb 13, 2020

@bboreham @murali-reddy I see that the logs seems to default to info level

  --log-level=info                                           logging level (debug, info, warning, error)

are the npc changes not shown at info level? As mentioned by @ephur earlier, we're not seeing them at all.

@bboreham
Copy link
Contributor

We actually set npc to debug level in the manifest. The problem I was commenting on is that your file starts at a time after the interesting part.

@naemono
Copy link
Contributor Author

naemono commented Apr 3, 2020

I'm going to go ahead and close this, and we have recently discovered that the panic referenced by @Quentin-M actually IS the root cause of our issue as well and #3771 is that continued discussion. Opening a PR now to attempt to crash the whole npc app when an internal panic occurs.

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2020

I'm going to re-open this issue to cover the underlying symptom described at #3764 (comment) - a panic causes a hang not a restart.

#3771's description covers what should happen after a restart, although the subsequent discussion is mostly about what causes the hang.

@bboreham bboreham reopened this Aug 4, 2020
@bboreham bboreham added this to the 2.7 milestone Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants