Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm upgrade failed: another operation (install/upgrade/rollback) is in progress #149

Closed
runningman84 opened this issue Nov 19, 2020 · 78 comments

Comments

@runningman84
Copy link

Sometimes helm releases are not installed because of this error:

{"level":"info","ts":"2020-11-19T15:41:11.273Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 50.12655ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:41:11.274Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:43:19.310Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.439664ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:43:19.310Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:52:42.524Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.944579ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:52:42.525Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}

In this case the helm release is stuck in pending status.

We have not found any corresponding log entry of the actual installation. Is this some concurrency bug?

@hiddeco
Copy link
Member

hiddeco commented Nov 19, 2020

Is this happening so often that it would be possible to enable the --log-level=debug flag for awhile so we get better insight into what Helm exactly does?

@seaneagan
Copy link
Contributor

seaneagan commented Nov 30, 2020

Not sure if it's related, but one potential source of releases getting stuck in pending-* would be non-graceful termination of the controller pods while a release action (install/upgrade/rollback) is in progress. I see that the controller-runtime has some support for that, not sure if we need to do anything to integrate with or test that, it seems like at least lengthening the default termination grace period (currently 10 seconds) may make sense.

@seaneagan
Copy link
Contributor

I also think it would useful if Helm separated the deployment status from the wait status, and allowed running the wait as a standalone functionality, and thus recovery from waits that failed or were interrupted. I'll try to get an issue created for that.

@hiddeco
Copy link
Member

hiddeco commented Nov 30, 2020

Not sure if it's related, but one potential source of releases getting stuck in pending-* would be non-graceful termination of the controller pods while a release action (install/upgrade/rollback) is in progress.

Based on feedback from another user, it does not seem to be related to pod restarts all the time, but still waiting on logs to confirm this. I tried to build in some behavior to detect a "stuck version" in #166.

Technically, without Helm offering full support for a context that can be cancelled, the graceful shutdown period would always require a configuration value equal to the highest timeout a HelmRelease has. I tried to advocate for this (the context support) in helm/helm#7958, but due to the implementation difficulties this never got off the ground and ended up as a request to create a HIP.

@Athosone
Copy link

We got the same issue in the company I work for.
We created a small tools that does helm operation. The issue occurs when we self update the tool it self. There is a race condition, if the old pod dies before helm is able to update the status of the release we end up in the exact same state.

We discovered fluxcd and we wanted to use it,
I wonder how Flux handles this ?

@brianpham
Copy link

Running into this same issue when updating datadog in one of our cluster. Any suggestions on how to handle this?

{"level":"error","ts":"2021-02-01T18:54:59.609Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"datadog","namespace":"datadog","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}

@hiddeco
Copy link
Member

hiddeco commented Feb 1, 2021

Can you provide additional information on the state of the release (as provided by helm), and what happened during the upgrade attempt (did for example the controller pod restart)?

@brianpham
Copy link

The controller pod did not restart. I just see a bunch of the error above in the helm-controller log messages.

I did notice this though when digging into the history a bit. I noticed that it tried upgrading the chart on the 26th and failed. Which was one I probably saw that error message that there was another operation in progress.

➜  git:(main) helm history datadog --kube-context -n datadog                
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION      
1               Fri Jan 22 23:19:33 2021        superseded      datadog-2.6.12  7               Install complete 
2               Fri Jan 22 23:29:34 2021        deployed        datadog-2.6.12  7               Upgrade complete 
3               Tue Jan 26 04:13:46 2021        pending-upgrade datadog-2.6.13  7               Preparing upgrade

I was able to do a rollback to revision 2 and then ran the helm reconcile and it seems to have went through just now.

@davidkarlsen
Copy link

davidkarlsen commented Feb 22, 2021

Try k describe helmreleases <therelease> and look at the events. In my case I believe it was caused by:

Events:
  Type    Reason  Age                  From             Message
  ----    ------  ----                 ----             -------
  Normal  info    47m (x3 over 47m)    helm-controller  HelmChart 'flux-system/postgres-operator-postgres-operator' is not ready
  Normal  error   26m (x4 over 42m)    helm-controller  reconciliation failed: Helm upgrade failed: timed out waiting for the condition

I did a helm upgrade by hand, and then it reconciled in flux too.

@monotek
Copy link

monotek commented Feb 22, 2021

I see the contantly on GKE at the moment.

Especially if i try to recreate a cluster from scratch.
All the flux pods are dying constantly, because k8s api can't be reached (kubectl refuses connection too).

{"level":"error","ts":"2021-02-22T14:16:27.377Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

The helm controller cant therefore also not reach the source controller:

Events:
  Type    Reason  Age                   From             Message
  ----    ------  ----                  ----             -------
  Normal  info    37m (x2 over 37m)     helm-controller  HelmChart 'infra/monitoring-kube-prometheus-stack' is not ready
  Normal  error   32m (x12 over 33m)    helm-controller  reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  error   32m (x13 over 33m)    helm-controller  Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  error   28m (x3 over 28m)     helm-controller  Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  error   28m (x3 over 28m)     helm-controller  reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  error   25m                   helm-controller  Get "http://source-controller.flux-system.svc.cluster.local./helmchart/infra/monitoring-kube-prometheus-stack/kube-prometheus-stack-13.5.0.tgz": dial tcp 10.83.240.158:80: connect: connection refused
  Normal  error   16m (x12 over 16m)    helm-controller  reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  error   5m50s (x18 over 17m)  helm-controller  Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
  Normal  info    33s                   helm-controller  HelmChart 'infra/monitoring-kube-prometheus-stack' is not ready

Not sure if flux is the cause by flooding the k8s api and some limits are reached?
I try with another master versions (from 1.18.14-gke.1600 to 1.18.15-gke.1500) now. Lets see if it helps.
Edit: update did not help

@hiddeco
Copy link
Member

hiddeco commented Feb 23, 2021

@monotek can you try setting the --concurrent flag on the helm-controller to a lower value (e.g. 2)?

@monotek
Copy link

monotek commented Feb 23, 2021

I'm using the fluxcd terraform provider.
Does it support altering this value?

My pod args look like:

  - args:
    - --events-addr=http://notification-controller/
    - --watch-all-namespaces=true
    - --log-level=info
    - --log-encoding=json
    - --enable-leader-election

So i guess the default vlaue of 4 is used?

I've changed it via "kubectl edit deploy" for now.
Should i do this for the other controllers too?

The cluster has kind of settled, as there were no new fluxcd pod restarts for today.
Last installtion of a helm chart worked flawlessly, even without setting the value.

I'll give feedback if adjusting the value helps, if we get unstable master api again.

Thanks for your help :)

@sreedharbukya
Copy link

I got same issue. Please check the following first. I was not even able to list the release under this usual command

helm list -n <name-space>

this was responding empty. So funny behavior from helm.

 kubectl config get-contexts

make sure your context is set for correct kuberenetes cluster.

then next step is

helm history <release> -n <name-space> --kube-context <kube-context-name>

try applying the rollback to above command.

helm rollback <release> <revision> -n <name-space> --kube-context <kube-context-nam>

@monotek
Copy link

monotek commented Feb 25, 2021

Use the following command to also see charts in all namespaces and also the ones where installation is in progress.

helm list -Aa

@zkl94
Copy link

zkl94 commented Feb 27, 2021

this is also happening in flux2. It seems to be the same problem and it is happening very frequently. I have to delete these failed helmreleases to recreate them and sometimes the recreation doesn't even work.

Before the helmreleases failed, I wasn't modifying them in VCS. But somehow they failed all of a sudden.

{"level":"debug","ts":"2021-02-27T09:40:37.480Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"278238"},"reason":"info","message":"Helm upgrade has started"} {"level":"debug","ts":"2021-02-27T09:40:37.498Z","logger":"controller.helmrelease","msg":"preparing upgrade for chaos-mesh","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh"} {"level":"debug","ts":"2021-02-27T09:40:37.584Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"278238"},"reason":"error","message":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"} {"level":"debug","ts":"2021-02-27T09:40:37.585Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"277102"},"reason":"error","message":"reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"} {"level":"info","ts":"2021-02-27T09:40:37.712Z","logger":"controller.helmrelease","msg":"reconcilation finished in 571.408644ms, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh"} {"level":"error","ts":"2021-02-27T09:40:37.712Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99"}

@iacou
Copy link

iacou commented Mar 3, 2021

also saw this today... no previous revision to roll back to, so had to delete the helmrelease and start the reconciliation again.

@avacaru
Copy link

avacaru commented Mar 4, 2021

I have encountered the same problem multiple times on different clusters. To fix the HelmRelease state I applied the workaround from this issue comment: helm/helm#8987 (comment) as deleting the HelmRelease could have unexpected consequences.

Some background that might be helpful in identifying the problem:

  • As part of a Jenkins pipeline I am upgrading the cluster (control plane and nodes) from 1.17 to 1.18, and immediately after that is finished I apply updated HelmRelease manifests -> reconciliation starts. Some manifests bring updates to existing releases, some bring in new releases (no previous Helm secret exists).
  • The helm-controller pod did not restart.

@rustrial
Copy link

rustrial commented Mar 9, 2021

Same here, and we constantly manually apply helm rollback ... && flux reconcile ... to fix it. What about adding a flag to HelmRelease to opt-in to a self healing approach, where the helm-controller would recognise HelmReleases in this state and automatically apply a rollback to them.

@mfamador
Copy link

mfamador commented Mar 10, 2021

Same here, from what I could see the flux controllers are crashing while reconciling the helmreleases and the charts stay with pending status.
Screenshot 2021-03-10 at 21 09 48

❯ helm list -Aa
NAME              	NAMESPACE   	REVISION	UPDATED                                	STATUS         	CHART                       	APP VERSION
flagger           	istio-system	1       	2021-03-10 20:53:41.632527436 +0000 UTC	deployed       	flagger-1.6.4               	1.6.4
flagger-loadtester	istio-system	1       	2021-03-10 20:53:41.523101293 +0000 UTC	deployed       	loadtester-0.18.0           	0.18.0
istio-operator    	istio-system	1       	2021-03-10 20:54:52.180338043 +0000 UTC	deployed       	istio-operator-1.7.0
loki              	monitoring  	1       	2021-03-10 20:53:42.29377712 +0000 UTC 	pending-install	loki-distributed-0.26.0     	2.1.0
prometheus-adapter	monitoring  	1       	2021-03-10 20:53:50.218395164 +0000 UTC	pending-install	prometheus-adapter-2.12.1   	v0.8.3
prometheus-stack  	monitoring  	1       	2021-03-10 21:08:35.889548922 +0000 UTC	pending-install	kube-prometheus-stack-14.0.1	0.46.0
tempo             	monitoring  	1       	2021-03-10 20:53:42.279556436 +0000 UTC	pending-install	tempo-distributed-0.8.5     	0.6.0

And the helm releases:

Every 5.0s: kubectl get helmrelease -n monitoring                                                                                                                                    tardis.Home: Wed Mar 10 21:14:39 2021

NAME                 READY   STATUS                                                                             AGE
loki                 False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m
prometheus-adapter   False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m
prometheus-stack     False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   16m
tempo                False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m

After deleting a helmrelease so that it can be recreated again, the kustomize-controller is crashing:

kustomize-controller-689774778b-rqhsq manager E0310 21:17:29.520573       6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/7593cc5d.fluxcd.io": context deadline exceeded
kustomize-controller-689774778b-rqhsq manager I0310 21:17:29.520663       6 leaderelection.go:278] failed to renew lease flux-system/7593cc5d.fluxcd.io: timed out waiting for the condition
kustomize-controller-689774778b-rqhsq manager {"level":"error","ts":"2021-03-10T21:17:29.520Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

helm uninstall for the pending-install releases seems to solve the problem some times, but most of the times the controllers are still crashing:

helm-controller-75bcfd86db-4mj8s manager E0310 22:20:31.375402       6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:31.375495       6 leaderelection.go:278] failed to renew lease flux-system/5b6ca942.fluxcd.io: timed out waiting for the condition
helm-controller-75bcfd86db-4mj8s manager {"level":"error","ts":"2021-03-10T22:20:31.375Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
- helm-controller-75bcfd86db-4mj8s › manager
+ helm-controller-75bcfd86db-4mj8s › manager
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.976Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"setup","msg":"starting manager"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:41.977697       7 leaderelection.go:243] attempting to acquire leader lease flux-system/5b6ca942.fluxcd.io...
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","msg":"starting metrics server","path":"/metrics"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:21:12.049163       7 leaderelection.go:253] successfully acquired lease flux-system/5b6ca942.

@pjastrzabek
Copy link

For all folks who do not experience helm controller crashes.. could you try adding bigger timeout to HelmRelease
timeout: 30m

@mfamador
Copy link

mfamador commented Mar 11, 2021

For all folks who do not experience helm controller crashes.. could you try adding bigger timeout to HelmRelease
timeout: 30m

(Now I realized that I've missed the NOT, the suggestion was only for the folks who are NOT experiencing crashes, clearly not meant for me :D ...)

Adding timeout: 30m to the HelmRelease with pending-install didn't prevent the controllers to crash.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
  name: kube-prometheus-stack
  namespace: monitoring
spec:
  chart:
    spec:
      chart: kube-prometheus-stack
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
      version: 14.0.1
  install:
    remediation:
      retries: 3
  interval: 1h0m0s
  releaseName: kube-prometheus-stack
  timeout: 30m

After: ❯ helm uninstall kube prometheus-stack -n monitoring all controllers start crashing (this is Azure AKS)

Every 5.0s: kubectl get pods -n flux-system                    tardis.Home: Thu Mar 11 22:05:35 2021

NAME                                           READY   STATUS             RESTARTS   AGE
helm-controller-5cf7d96887-nz9rm               0/1     CrashLoopBackOff   15         23h
image-automation-controller-686ffd758c-b9vwd   0/1     CrashLoopBackOff   29         27h
image-reflector-controller-85796d5c4d-dtvjq    1/1     Running            28         27h
kustomize-controller-689774778b-rqhsq          0/1     CrashLoopBackOff   30         26h
notification-controller-769876bb9f-cb25k       0/1     CrashLoopBackOff   25         27h
source-controller-c55db769d-fwc7h              0/1     Error              30         27h

and the release gets pending:

❯ helm list -a -n monitoring
NAME              	NAMESPACE 	REVISION	UPDATED                                	STATUS         	CHART                       	APP VERSION
kube-prometheus-stack  	monitoring	1       	2021-03-11 22:08:23.473944235 +0000 UTC	pending-install	kube-prometheus-stack-14.0.1	0.46.0

@stefanprodan
Copy link
Member

@mfamador your issue is different, based on “ Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded” I would say that your AKS network is broken or the Azure proxy for the Kubernetes API is crashing. Please reach to Azure support as this is not something we can fix for you.

@mfamador
Copy link

You're probably right @stefanprodan, I'll reach them, but to clarify, this is a brand new AKS cluster, which has been destroyed and recreated from scratch multiple times, and always ending up with the Flux v2 crashing, most of the times when installing the kube-prometheus-stack helm chart, others with Loki or Tempo. We've been creating several AKS clusters and we're only seeing this when using Flux 2, find hard to believe that's an AKS problem.

@stefanprodan
Copy link
Member

stefanprodan commented Mar 12, 2021

@mfamador if Flux leader election times out then I don’t see how any other controller would work, we don’t do anything special here, leader election is implemented with upstream Kubernetes libraries. Check out the AKS FAQ, seems that Azure has serious architectural issues as they use some proxy called tunnelfront or aks-link that you need to restart from time to time 😱 https://docs.microsoft.com/en-us/azure/aks/troubleshooting

Check whether the tunnelfront or aks-link pod is running in the kube-system namespace using the kubectl get pods --namespace kube-system command. If it isn't, force deletion of the pod and it will restart.

@hiddeco
Copy link
Member

hiddeco commented Mar 12, 2021

If on AKS the cluster API for some reason becomes overwhelmed by the requests (that should be cached, sane, and not cause much pressure on an average cluster), another thing you may want to try is to trim down on the concurrent processing for at least the Helm releases / helm-controller by tweaking the --concurrent flag as described in #149 (comment).

@mfamador
Copy link

Thanks, @stefanprodan and @hiddeco, I'll give it a try

@mfamador
Copy link

@hiddeco, the controllers are still crashing after setting --concurrent=1 on helm-controller, I'll try with another AKS version

@hiddeco
Copy link
Member

hiddeco commented Mar 12, 2021

Then I think it is as @stefanprodan describes, and probably related to some CNI/tunnel front issue in AKS.

@mfamador
Copy link

mfamador commented Mar 12, 2021

Thanks, @hiddeco, yes, I think you might be right, that brings me many concerns about using AKS in production, I'll try another CNI to see if it gets better.

@marcocaberletti
Copy link

Same issue for me after the upgrade to flux v0.21.0, on the 11/02 in the plot below.
I notice a huge increase in memory usage for the helm-controller, so often the pod is OOM-killed and Helm releases stay in pending-upgrade status.

Screenshot from 2021-11-11 16-53-55

@stefanprodan
Copy link
Member

@marcocaberletti we fixed the OOM issues, we'll do a flux release today. Please see: https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0122

@marcocaberletti
Copy link

Great! Thanks!

@Cobesz
Copy link

Cobesz commented Dec 24, 2021

I got same issue. Please check the following first. I was not even able to list the release under this usual command

helm list -n <name-space>

this was responding empty. So funny behavior from helm.

 kubectl config get-contexts

make sure your context is set for correct kuberenetes cluster.

then next step is

helm history <release> -n <name-space> --kube-context <kube-context-name>

try applying the rollback to above command.

helm rollback <release> <revision> -n <name-space> --kube-context <kube-context-nam>

This saved me my chrismas eve dinner, thank you so much!

@kishoregv
Copy link

We have experienced this behavior when helm upgrade fails and helm-controller fails to update the status on HelmRelease. Status update failure is due to HelmRelease object is modified.

At that point the HelmRelease stuck in Another operation is in progress even though there is no other operation in pending.

Maybe there should be retry on the status updates in helm controller.

@stevejr
Copy link

stevejr commented Apr 28, 2022

@hiddeco / @stefanprodan - is it possible to get an update on this issue?

@hiddeco
Copy link
Member

hiddeco commented Apr 28, 2022

Yes.

The helm-controller is scheduled to see the same refactoring round as the source-controller recently did, in which reconciliation logic in the broadest sense will be improved, and long standing issues will be taken care of. I expect to start on this at the beginning of next week.

@artem-nefedov
Copy link

That's great news. I sometimes feel that Helm is second-class citizen in Flux world, it's about time it got some love.

Don't get me wrong, I love Flux. But, sometimes I do miss ArgoCD's approach where everything is managed through same object kind "Application", meaning Helm gets all the same benefits as any other deployment.

@mjnagel
Copy link

mjnagel commented Jun 14, 2022

Just going to throw some context here on where I'm seeing this issue...

It seems consistently tied to a helm-controller crash during reconciliation and leaving the sh.helm.release.x secret in a bad state (i.e. the secret still shows a reconciliation in progress despite nothing happening). Originally some of the crashes we saw were due to OOM (since we were setting lower limits), bumping the limits partially resolved the issue. Currently the crashes/issues seem related to the k8s API being overwhelmed by new resources/changes. We have roughly 30 helmreleases some of which have large amounts of manifests (and certainly slow down kube api connections while being applied/reconciling). The logs in the helm-controller seem to indicate an inability to hit the kube-api resulting in the crash:

Failed to update lock: Put "https://<API IP>:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/helm-controller-leader-election": context deadline exceeded

Not sure if this context is helpful or repetitive - I'd assume this behavior is already being tracked in relation to the refactoring mentioned above? I can provide additional logs and observations if that would help.

@sharkztex
Copy link

We've experienced issues where some of our release get stuck on:

"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"

The only way we know how to get passed this is by deleting the release, would the refactoring of the helm-controller address this or is there an alternative way to get the release rolled out without having to delete it?

helm-controller: v0.22.1
image-automation-controller: v0.23.2
image-reflector-controller: v0.19.1
kustomize-controller: v0.26.1
notification-controller: v0.24.0
source-controller: v0.24.4`

@mjnagel
Copy link

mjnagel commented Jun 16, 2022

We've experienced issues where some of our release get stuck on:

"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"

@sharkztex this is the same problem I commonly see. The workarounds I know of are:

# example w/ kiali
HR_NAME=kiali
HR_NAMESPACE=kiali
kubectl get secrets -n ${HR_NAMESPACE} | grep ${HR_NAME}
# example output:
sh.helm.release.v1.kiali.v1                                       helm.sh/release.v1                    1      18h
sh.helm.release.v1.kiali.v2                                       helm.sh/release.v1                    1      17h
sh.helm.release.v1.kiali.v3                                       helm.sh/release.v1                    1      17m
# Delete the most recent one:
kubectl delete secret -n ${HR_NAMESPACE} sh.helm.release.v1.${HR_NAME}.v3

# suspend/resume the hr
flux suspend hr -n ${HR_NAMESPACE} ${HR_NAME}
flux resume hr -n ${HR_NAMESPACE} ${HR_NAME}

Alternatively you can use helm rollback:

HR_NAME=kiali
HR_NAMESPACE=kiali

# Run a helm history command to get the latest release before the issue (should show deployed)
helm history ${HR_NAME} -n ${HR_NAMESPACE} 
# Use that revision in this command
helm rollback ${HR_NAME} <revision> -n ${HR_NAMESPACE} 
flux reconcile hr bigbang -n ${HR_NAMESPACE} 

@cgeisel
Copy link

cgeisel commented Sep 19, 2022

I'm currently experiencing this error, but I'm seeing an endless loop of helm releases every few seconds.

➜  ~ flux version
flux: v0.30.2
helm-controller: v0.21.0
kustomize-controller: v0.25.0
notification-controller: v0.23.5
source-controller: v0.24.4

I'm new to flux and helm, so I may be missing something obvious, but the output of some of the suggested commands does not have the expected results.

As you can see, one of my releases has 8702 revisions and counting.

➜  ~ helm list -A
NAME            	NAMESPACE     	REVISION	UPDATED                                	STATUS  	CHART                               	APP VERSION
089606735968    	ci-python-8aa8	1       	2022-08-26 17:09:01.725954063 +0000 UTC	deployed	irsa-service-account-0.1.0          	1.0.0
284950274094    	ci-python-8aa8	8702    	2022-09-19 23:30:38.61301278 +0000 UTC 	deployed	irsa-service-account-0.1.0          	1.0.0
karpenter       	karpenter     	1       	2022-06-02 20:13:18.670847828 +0000 UTC	deployed	karpenter-0.9.1                     	0.9.1
prd-default-east	gitlab-runner 	10      	2022-09-14 00:15:24.978728955 +0000 UTC	deployed	gitlab-runner-0.43.1                	15.2.1
splunk-connect  	splunk-connect	1       	2022-06-30 22:10:40.269231296 +0000 UTC	deployed	splunk-connect-for-kubernetes-1.4.15	1.4.15

As you can also see, the revision has gone up in between running these two commands:

➜  ~ helm history 284950274094 -n flux-system
REVISION	UPDATED                 	STATUS    	CHART                     	APP VERSION	DESCRIPTION
8697    	Mon Sep 19 23:28:36 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8698    	Mon Sep 19 23:28:38 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8699    	Mon Sep 19 23:29:37 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8700    	Mon Sep 19 23:29:38 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8701    	Mon Sep 19 23:30:37 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8702    	Mon Sep 19 23:30:38 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8703    	Mon Sep 19 23:31:37 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8704    	Mon Sep 19 23:31:39 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8705    	Mon Sep 19 23:32:37 2022	superseded	irsa-service-account-0.1.0	1.0.0      	Upgrade complete
8706    	Mon Sep 19 23:32:39 2022	deployed  	irsa-service-account-0.1.0	1.0.0      	Upgrade complete

There's no "good" revision to rollback to in the output. The latest one has status deployed but will be replaced shortly by a new revision.

In addition, when trying to suspend the helm release, I get this error message.

➜  ~ flux suspend hr -n flux-system 284950274094
✗ no HelmRelease objects found in flux-system namespace

Which confuses me, since the helm history command seems to think those revisions are part of the flux-system namespace.

@alex-berger
Copy link
Contributor

This problem still persists. It is very annoying that we cannot rely on GitOps to eventually converge a cluster to the expected state as it is stuck on affected HelmRelease objects, stuck in Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. Or to phrase it with other words, FluxCD based GitOps setup with HelmReleases is not self-healing and needs a lot of (unpredictable) manual interventions to run helm rollback ... && flux reconcile hr ... commands in order to fix things.

Is there anything that prevents us from adding a new feature to the helm-controller to detect stuck (locked) HelmReleases and automatically fix them by rolling them back immediately followed by a reconciliation?

@danports
Copy link

I see this problem frequently, mainly with Helm upgrades that take a while to complete - e.g. kube-prometheus-stack, which takes 5-10 minutes to upgrade in one of my clusters. I almost never have this problem with upgrades that take only 1-2 minutes.

@stefanprodan
Copy link
Member

@alex-berger @danports the locking issue seems to happen only if helm-controller is OOMKILL (due to out of memory) or SIGKILL (if the node where it's running dies without evicting pods first). Is this the case for you?

@danports
Copy link

danports commented Mar 3, 2023

I'll have to take a deeper look at logs/metrics to confirm, but that wouldn't surprise me, since I've been having intermittent OOM issues on the nodes where helm-controller runs. If that's the case, this seems less like a Flux bug and more like a node stability issue on my end, though a self-healing feature along the lines of what @alex-berger suggested would be nice.

@stefanprodan
Copy link
Member

Helm itself places a lock when it starts an upgrade, if you kill Helm while doing it, it leaves the lock in place preventing any further upgrade operations. Doing a rollback is very expensive and can have grave consequences for charts with statefulsets or charts that contain hooks which perform db migrations and other state altering operations. We'll need to find a way to remove the lock without affecting the deployed workloads.

@alex-berger
Copy link
Contributor

Actually, I doubt this only happens on OOMKILL. We have very dynamic clusters, with Karpenter constantly replacing nodes (for good reasons), and thus it happens very often that the helm-controller Pods are evicted (and terminated) in the middle of long running HelmReleases. Note, long running HelmReleases are not uncommon, with high-availability setups, rolling upgrades of Deployments, DaemonSets and especially StatefulSets can takes dozens of minutes or even several hours.

My educated guess is, that this problem happens whenever there are HelmReleases in progress and the helm-controller Pod is forcefully terminated (SIGTERM, SIGKILL, OOMKILL, or any other unhandled singal). I am convinced that we have to anticipate this, though we can still try to improve helm-controller respectively helm to reduce the probability that it happens. Anyway, as this can still happen, we should have an automatic recovery capability built into the helm-controller (or helm) to make sure such HelmReleases are automatically recovered.

After all, GitOps should not have to rely on humans, waiting for pager calls just to manually running helm rollback ... && flux reconcile hr ....

@hiddeco
Copy link
Member

hiddeco commented Mar 3, 2023

After #620 shutdown signals should now be handled properly, for OOMKILL we can not gracefully shut down in time. For which #628 will be an option until unlocking can be handled safely.

@stefanprodan
Copy link
Member

Actually, I doubt this only happens on OOMKILL. We have very dynamic clusters, with Karpenter constantly replacing nodes (for good reasons), and thus it happens very often that the helm-controller Pods are evicted (and terminated) in the middle of long running HelmReleases. Note, long running HelmReleases are not uncommon, with high-availability setups, rolling upgrades of Deployments, DaemonSets and especially StatefulSets can takes dozens of minutes or even several hours.

Until we figure out how to recover the Helm storage at restart, I suggest you move helm-controller to EKS Fargate or some dedicated node outside of Karpenter, this would allow helm-controller to perform hours long upgrades without interruption.

@alex-berger
Copy link
Contributor

alex-berger commented Mar 8, 2023

Moving helm-controller to EKS Fargate might mitigate the problem (a bit) and actually this is what we are currently working on.

However, as we are using Cilium (CNI) this also needs changes to the NetworkPolicy objects deployed by FluxCD, especially the allow-egress policy must be patched to allow for traffic from (non Cilium managed) Pods running on EKS Fargate to the (Cilium managed) Pods still running on EC2. We achieved this by changing it to something like this:

spec:
  podSelector: {}
  ingress:
    - {} # All ingress
  egress:
    - {} # All egress
  policyTypes:
    - Ingress
    - Egress

As you can see, the work-around using EKS Fargate weakens security as we had to open-up the NetworkPolicy quite a bit. In our case we can take the risk as we only run trusted workloads on those clusters. But for other users this might be a no-go.

@hiddeco
Copy link
Member

hiddeco commented Mar 10, 2023

Closing this in favor of #644. Thank you all!

@hiddeco hiddeco closed this as completed Mar 10, 2023
qlonik added a commit to qlonik/musical-parakeet that referenced this issue Jul 9, 2023
This was recommended as one of the suggestions in a bug report with the
similar issue fluxcd/helm-controller#149.
qlonik added a commit to qlonik/musical-parakeet that referenced this issue Jul 9, 2023
This was recommended as one of the suggestions in a bug report with the
similar issue fluxcd/helm-controller#149.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests