-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Flux can get stuck, producing no output/work, with no liveness check on AKS #1648
Comments
Per @stefanprodan at https://weave-community.slack.com/archives/C4U5ATZ9S/p1547233551535300 tried to do following, though gave up after 14:13-14:08=5 minutes with nothing being produced.
Of note, |
I deleted the pod ( |
I am running in AKS. There have been problems with processes (#1447, Azure/AKS#676) talking to the AKS api-server, especially with long-running watches. My pods do have the I have found another process talking to the same AKS api-server (though from a jenkins worker, running outside the cluster, on a different vnet, in the same azure subscription) which also had a problem. This other process is written in python. It dumps the server-version (successful) and then starts a watch (all namespaces, for pods). The python process's watch produced no output at the same time window as the comment 0 shenanigans were happening, despite changes happening to running pods. This is all quite circumstantial. It's possible both flux and that other process were victims of AKS api-server intermittency, and neither tried to simply drop the network connection and create a new watch. I don't know enough about flux's internal design to know whether that would cause |
fluxd doesn't keep watches (unless they internal to the Kubernetes client), but it is a heavy user of the Kubernetes API otherwise. Much of the action is driven from a single goroutine: syncing, update jobs, and image polling are all done from one loop. (The flux API and metrics have their own goroutines.) It's plausible that it got stuck waiting for an API response that was never to arrive -- though we are pretty careful to use contexts with timeouts. |
I have not observed this behaviour again in the last 4 weeks. |
I have not observed this behavior again in the last 13 weeks. Closing. Please reopen if a reproducible situation is identified. |
@alanjcastonguay Thanks for patiently keeping an eye on this issue, and tidying it away |
I upgraded to
flux:1.8.2
(andhelm-operator:0.5.2
via chartflux-0.5.2
) about 24h ago. The previous version wasflux:1.7.1
(from chartflux-0.3.4
). Client is an unupgradedfluxctl-1.5.0
, though I doubt that's relevant here. Some 18 hours ago, Flux stopped discovering new images or doing much useful work. During this 18h there's been zero output from the container.Last line on stdout:
Current time:
The flux process is not completely dead, however.
~/Downloads/fluxctl_darwin_amd64.dms list-controllers
completes normally, and aGET http://localhost:3030/metrics
yields prometheus metrics https://gist.github.com/./459d9c9f5207989ba326a9f405dc73ea~/Downloads/fluxctl_darwin_amd64.dms sync
fails with a timeout, though.Other change-centric activities, like locking a controller, also timeout.
While running the above there has been no additional output logged from the container.
Flux is using a normal/small amount of cpu/memory/handles,
fluxd
is being started with somewhat vanilla argumentsSending
SIGUSR1
andSIGABRT
andSIGTERM
to the go process didn't produce anything additional in the container log, and pid7 remained running.I am cognizant that simply saying "it's hung" is often incorrect and an indicator of a lack of understanding of what's going on, though I don't see a way to get any additional information out. I expect that deleting this Pod will workaround whatever's stuck.
I would like to see a kubernetes liveness check capable of noticing this broken state be part of the standard helm chart.
The text was updated successfully, but these errors were encountered: