-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm upgrade failed: another operation (install/upgrade/rollback) is in progress #149
Comments
Is this happening so often that it would be possible to enable the |
Not sure if it's related, but one potential source of releases getting stuck in pending-* would be non-graceful termination of the controller pods while a release action (install/upgrade/rollback) is in progress. I see that the controller-runtime has some support for that, not sure if we need to do anything to integrate with or test that, it seems like at least lengthening the default termination grace period (currently 10 seconds) may make sense. |
I also think it would useful if Helm separated the deployment status from the wait status, and allowed running the wait as a standalone functionality, and thus recovery from waits that failed or were interrupted. I'll try to get an issue created for that. |
Based on feedback from another user, it does not seem to be related to pod restarts all the time, but still waiting on logs to confirm this. I tried to build in some behavior to detect a "stuck version" in #166. Technically, without Helm offering full support for a context that can be cancelled, the graceful shutdown period would always require a configuration value equal to the highest timeout a |
We got the same issue in the company I work for. We discovered fluxcd and we wanted to use it, |
Running into this same issue when updating datadog in one of our cluster. Any suggestions on how to handle this?
|
Can you provide additional information on the state of the release (as provided by |
The controller pod did not restart. I just see a bunch of the error above in the helm-controller log messages. I did notice this though when digging into the history a bit. I noticed that it tried upgrading the chart on the 26th and failed. Which was one I probably saw that error message that there was another operation in progress.
I was able to do a rollback to revision 2 and then ran the helm reconcile and it seems to have went through just now. |
Try
I did a helm upgrade by hand, and then it reconciled in flux too. |
I see the contantly on GKE at the moment. Especially if i try to recreate a cluster from scratch.
The helm controller cant therefore also not reach the source controller:
Not sure if flux is the cause by flooding the k8s api and some limits are reached? |
@monotek can you try setting the |
I'm using the fluxcd terraform provider. My pod args look like:
So i guess the default vlaue of 4 is used? I've changed it via "kubectl edit deploy" for now. The cluster has kind of settled, as there were no new fluxcd pod restarts for today. I'll give feedback if adjusting the value helps, if we get unstable master api again. Thanks for your help :) |
I got same issue. Please check the following first. I was not even able to list the release under this usual command
this was responding empty. So funny behavior from helm.
make sure your context is set for correct kuberenetes cluster. then next step is
try applying the rollback to above command.
|
Use the following command to also see charts in all namespaces and also the ones where installation is in progress.
|
this is also happening in flux2. It seems to be the same problem and it is happening very frequently. I have to delete these failed helmreleases to recreate them and sometimes the recreation doesn't even work. Before the helmreleases failed, I wasn't modifying them in VCS. But somehow they failed all of a sudden.
|
also saw this today... no previous revision to roll back to, so had to delete the helmrelease and start the reconciliation again. |
I have encountered the same problem multiple times on different clusters. To fix the HelmRelease state I applied the workaround from this issue comment: helm/helm#8987 (comment) as deleting the HelmRelease could have unexpected consequences. Some background that might be helpful in identifying the problem:
|
Same here, and we constantly manually apply |
For all folks who do not experience helm controller crashes.. could you try adding bigger timeout to HelmRelease |
(Now I realized that I've missed the NOT, the suggestion was only for the folks who are NOT experiencing crashes, clearly not meant for me :D ...) Adding
After:
and the release gets pending:
|
@mfamador your issue is different, based on “ Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded” I would say that your AKS network is broken or the Azure proxy for the Kubernetes API is crashing. Please reach to Azure support as this is not something we can fix for you. |
You're probably right @stefanprodan, I'll reach them, but to clarify, this is a brand new AKS cluster, which has been destroyed and recreated from scratch multiple times, and always ending up with the Flux v2 crashing, most of the times when installing the kube-prometheus-stack helm chart, others with Loki or Tempo. We've been creating several AKS clusters and we're only seeing this when using Flux 2, find hard to believe that's an AKS problem. |
@mfamador if Flux leader election times out then I don’t see how any other controller would work, we don’t do anything special here, leader election is implemented with upstream Kubernetes libraries. Check out the AKS FAQ, seems that Azure has serious architectural issues as they use some proxy called tunnelfront or aks-link that you need to restart from time to time 😱 https://docs.microsoft.com/en-us/azure/aks/troubleshooting
|
If on AKS the cluster API for some reason becomes overwhelmed by the requests (that should be cached, sane, and not cause much pressure on an average cluster), another thing you may want to try is to trim down on the concurrent processing for at least the Helm releases / helm-controller by tweaking the |
Thanks, @stefanprodan and @hiddeco, I'll give it a try |
@hiddeco, the controllers are still crashing after setting |
Then I think it is as @stefanprodan describes, and probably related to some CNI/tunnel front issue in AKS. |
Thanks, @hiddeco, yes, I think you might be right, that brings me many concerns about using AKS in production, I'll try another CNI to see if it gets better. |
@marcocaberletti we fixed the OOM issues, we'll do a flux release today. Please see: https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0122 |
Great! Thanks! |
This saved me my chrismas eve dinner, thank you so much! |
We have experienced this behavior when helm upgrade fails and helm-controller fails to update the status on HelmRelease. Status update failure is due to HelmRelease object is modified. At that point the HelmRelease stuck in Another operation is in progress even though there is no other operation in pending. Maybe there should be retry on the status updates in helm controller. |
@hiddeco / @stefanprodan - is it possible to get an update on this issue? |
Yes. The helm-controller is scheduled to see the same refactoring round as the source-controller recently did, in which reconciliation logic in the broadest sense will be improved, and long standing issues will be taken care of. I expect to start on this at the beginning of next week. |
That's great news. I sometimes feel that Helm is second-class citizen in Flux world, it's about time it got some love. Don't get me wrong, I love Flux. But, sometimes I do miss ArgoCD's approach where everything is managed through same object kind "Application", meaning Helm gets all the same benefits as any other deployment. |
Just going to throw some context here on where I'm seeing this issue... It seems consistently tied to a helm-controller crash during reconciliation and leaving the
Not sure if this context is helpful or repetitive - I'd assume this behavior is already being tracked in relation to the refactoring mentioned above? I can provide additional logs and observations if that would help. |
We've experienced issues where some of our release get stuck on:
The only way we know how to get passed this is by deleting the release, would the refactoring of the helm-controller address this or is there an alternative way to get the release rolled out without having to delete it?
|
@sharkztex this is the same problem I commonly see. The workarounds I know of are: # example w/ kiali
HR_NAME=kiali
HR_NAMESPACE=kiali
kubectl get secrets -n ${HR_NAMESPACE} | grep ${HR_NAME}
# example output:
sh.helm.release.v1.kiali.v1 helm.sh/release.v1 1 18h
sh.helm.release.v1.kiali.v2 helm.sh/release.v1 1 17h
sh.helm.release.v1.kiali.v3 helm.sh/release.v1 1 17m
# Delete the most recent one:
kubectl delete secret -n ${HR_NAMESPACE} sh.helm.release.v1.${HR_NAME}.v3
# suspend/resume the hr
flux suspend hr -n ${HR_NAMESPACE} ${HR_NAME}
flux resume hr -n ${HR_NAMESPACE} ${HR_NAME} Alternatively you can use helm rollback: HR_NAME=kiali
HR_NAMESPACE=kiali
# Run a helm history command to get the latest release before the issue (should show deployed)
helm history ${HR_NAME} -n ${HR_NAMESPACE}
# Use that revision in this command
helm rollback ${HR_NAME} <revision> -n ${HR_NAMESPACE}
flux reconcile hr bigbang -n ${HR_NAMESPACE} |
I'm currently experiencing this error, but I'm seeing an endless loop of helm releases every few seconds.
I'm new to flux and helm, so I may be missing something obvious, but the output of some of the suggested commands does not have the expected results. As you can see, one of my releases has 8702 revisions and counting.
As you can also see, the revision has gone up in between running these two commands:
There's no "good" revision to rollback to in the output. The latest one has status In addition, when trying to suspend the helm release, I get this error message.
Which confuses me, since the |
This problem still persists. It is very annoying that we cannot rely on GitOps to eventually converge a cluster to the expected state as it is stuck on affected HelmRelease objects, stuck in Is there anything that prevents us from adding a new feature to the helm-controller to detect stuck (locked) HelmReleases and automatically fix them by rolling them back immediately followed by a reconciliation? |
I see this problem frequently, mainly with Helm upgrades that take a while to complete - e.g. |
@alex-berger @danports the locking issue seems to happen only if helm-controller is OOMKILL (due to out of memory) or SIGKILL (if the node where it's running dies without evicting pods first). Is this the case for you? |
I'll have to take a deeper look at logs/metrics to confirm, but that wouldn't surprise me, since I've been having intermittent OOM issues on the nodes where helm-controller runs. If that's the case, this seems less like a Flux bug and more like a node stability issue on my end, though a self-healing feature along the lines of what @alex-berger suggested would be nice. |
Helm itself places a lock when it starts an upgrade, if you kill Helm while doing it, it leaves the lock in place preventing any further upgrade operations. Doing a rollback is very expensive and can have grave consequences for charts with statefulsets or charts that contain hooks which perform db migrations and other state altering operations. We'll need to find a way to remove the lock without affecting the deployed workloads. |
Actually, I doubt this only happens on My educated guess is, that this problem happens whenever there are After all, GitOps should not have to rely on humans, waiting for pager calls just to manually running |
Until we figure out how to recover the Helm storage at restart, I suggest you move helm-controller to EKS Fargate or some dedicated node outside of Karpenter, this would allow helm-controller to perform hours long upgrades without interruption. |
Moving helm-controller to EKS Fargate might mitigate the problem (a bit) and actually this is what we are currently working on. However, as we are using Cilium (CNI) this also needs changes to the NetworkPolicy objects deployed by FluxCD, especially the spec:
podSelector: {}
ingress:
- {} # All ingress
egress:
- {} # All egress
policyTypes:
- Ingress
- Egress As you can see, the work-around using EKS Fargate weakens security as we had to open-up the NetworkPolicy quite a bit. In our case we can take the risk as we only run trusted workloads on those clusters. But for other users this might be a no-go. |
Closing this in favor of #644. Thank you all! |
This was recommended as one of the suggestions in a bug report with the similar issue fluxcd/helm-controller#149.
This was recommended as one of the suggestions in a bug report with the similar issue fluxcd/helm-controller#149.
Sometimes helm releases are not installed because of this error:
In this case the helm release is stuck in pending status.
We have not found any corresponding log entry of the actual installation. Is this some concurrency bug?
The text was updated successfully, but these errors were encountered: