Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux can get stuck, producing no output/work, with no liveness check on AKS #1648

Closed
ellieayla opened this issue Jan 11, 2019 · 7 comments
Closed
Labels

Comments

@ellieayla
Copy link
Contributor

ellieayla commented Jan 11, 2019

I upgraded to flux:1.8.2 (and helm-operator:0.5.2 via chart flux-0.5.2) about 24h ago. The previous version was flux:1.7.1 (from chart flux-0.3.4). Client is an unupgraded fluxctl-1.5.0, though I doubt that's relevant here. Some 18 hours ago, Flux stopped discovering new images or doing much useful work. During this 18h there's been zero output from the container.

Last line on stdout:

ts=2019-01-11T00:26:22.280645985Z caller=warming.go:237 component=warmer canonical_name=quay.io/prometheus/node-exporter auth="REDACTED" trace="found cached manifest" ref=quay.io/prometheus/node-exporter:master last_fetched=2019-01-10T22:10:43Z deadline=2019-01-11T02:10:38Z

Current time:

2019-01-11T18:16:22+0000

The flux process is not completely dead, however. ~/Downloads/fluxctl_darwin_amd64.dms list-controllers completes normally, and a GET http://localhost:3030/metrics yields prometheus metrics https://gist.github.com/./459d9c9f5207989ba326a9f405dc73ea

~/Downloads/fluxctl_darwin_amd64.dms sync fails with a timeout, though.

Synchronizing with [email protected]
Failed to complete sync job (ID "942a4e29-3bf1-165a-880d-05a2307d7e27")
Error: timeout
Run 'fluxctl sync --help' for usage.

Other change-centric activities, like locking a controller, also timeout.


We timed out waiting for the result of the operation. This does not
necessarily mean it has failed. You can check the state of the
cluster, or commit logs, to see if there was a result. In general, it
is safe to retry operations.
Error: timeout
Run 'fluxctl lock --help' for usage.

While running the above there has been no additional output logged from the container.

Flux is using a normal/small amount of cpu/memory/handles,

NAME                                  CPU(cores)   MEMORY(bytes)   
flux-645959c797-f598r                 3m           31Mi        
$ kubectl -n flux exec -it flux-645959c797-f598r -- lsof | wc -l
      49

fluxd is being started with somewhat vanilla arguments

      --k8s-secret-name=flux-git-deploy
      --memcached-hostname=flux-memcached
      --git-url=PRIVATE.git
      --git-branch=master
      --git-path=prod-k8s
      --git-user=Flux.prod-k8s
      [email protected]
      --git-set-author=false
      --git-poll-interval=5m
      --git-timeout=20s
      --sync-interval=5m
      --git-ci-skip=false
      --git-label=flux-sync-prod
      --registry-poll-interval=5m
      --registry-rps=200
      --registry-burst=125
      --registry-trace=true

Sending SIGUSR1 and SIGABRT and SIGTERM to the go process didn't produce anything additional in the container log, and pid7 remained running.

I am cognizant that simply saying "it's hung" is often incorrect and an indicator of a lack of understanding of what's going on, though I don't see a way to get any additional information out. I expect that deleting this Pod will workaround whatever's stuck.

I would like to see a kubernetes liveness check capable of noticing this broken state be part of the standard helm chart.

@ellieayla
Copy link
Contributor Author

ellieayla commented Jan 11, 2019

Per @stefanprodan at https://weave-community.slack.com/archives/C4U5ATZ9S/p1547233551535300 tried to do following, though gave up after 14:13-14:08=5 minutes with nothing being produced.

/home/flux # wget http://localhost:3030/debug/pprof/trace?seconds=5
Connecting to localhost:3030 (127.0.0.1:3030)

Of note, GET http://localhost:3030/metrics was producing results before I created this ticket but now is also not returning results. That could mean there's a causal relationship between /metrics and /debug/pprof/trace behavior, though I would suspect they're both victims of a common cause. (Naively, I would question thread pool exhaustion.)

@ellieayla
Copy link
Contributor Author

ellieayla commented Jan 11, 2019

I deleted the pod (kubectl -n flux delete pod flux-645959c797-f598r). The replacement flux pod is producing much output, found some interim-created missed images, updated some resources and appears to be healthy.

@ellieayla
Copy link
Contributor Author

ellieayla commented Jan 11, 2019

I am running in AKS. There have been problems with processes (#1447, Azure/AKS#676) talking to the AKS api-server, especially with long-running watches. My pods do have the KUBERNETES_PORT/KUBERNETES_SERVICE_HOST/etc environment variables pointing to the public FQDN like Azure/AKS#676 recommends.

I have found another process talking to the same AKS api-server (though from a jenkins worker, running outside the cluster, on a different vnet, in the same azure subscription) which also had a problem.

This other process is written in python. It dumps the server-version (successful) and then starts a watch (all namespaces, for pods). The python process's watch produced no output at the same time window as the comment 0 shenanigans were happening, despite changes happening to running pods.

This is all quite circumstantial. It's possible both flux and that other process were victims of AKS api-server intermittency, and neither tried to simply drop the network connection and create a new watch. I don't know enough about flux's internal design to know whether that would cause sync to timeout.

@stefanprodan stefanprodan changed the title Flux can get stuck, producing no output/work, with no liveness check Flux can get stuck, producing no output/work, with no liveness check on AKS Jan 12, 2019
@squaremo
Copy link
Member

fluxd doesn't keep watches (unless they internal to the Kubernetes client), but it is a heavy user of the Kubernetes API otherwise. Much of the action is driven from a single goroutine: syncing, update jobs, and image polling are all done from one loop. (The flux API and metrics have their own goroutines.)

It's plausible that it got stuck waiting for an API response that was never to arrive -- though we are pretty careful to use contexts with timeouts.

@2opremio 2opremio added the bug label Jan 14, 2019
@ellieayla
Copy link
Contributor Author

I have not observed this behaviour again in the last 4 weeks.

@ellieayla
Copy link
Contributor Author

I have not observed this behavior again in the last 13 weeks. Closing. Please reopen if a reproducible situation is identified.

@squaremo
Copy link
Member

squaremo commented Apr 8, 2019

@alanjcastonguay Thanks for patiently keeping an eye on this issue, and tidying it away

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants