Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-state manifest drift correction #186

Closed
ruchikaguptaa opened this issue Jan 7, 2021 · 36 comments
Closed

Cluster-state manifest drift correction #186

ruchikaguptaa opened this issue Jan 7, 2021 · 36 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ruchikaguptaa
Copy link

ruchikaguptaa commented Jan 7, 2021

Suppose HelmRelease object deploys a helm chart that contains a deployment. The pod resource can now be updated by the user. Until there is a chart version update in HelmRelease object, the pod will continue to have the manually updated setting. I'm looking for a way to force reconcile the resources deployed by the HelmRelease after every x minutes. Does helm-controller support this?

There are a few other intervals like one used by HelmRepository and HelmRelease but they don't seem to reconcile a previous succeeded release unless there is a new version in the HelmRelease. requeueDependency interval is only responsible for reconcile if the dependencies for that helmrelease failed. These seem to be unrelated to my scenario.

@ruchikaguptaa
Copy link
Author

Tagging a few folks as per the PRs to the reconcile helmchart part of code @hiddeco @seaneagan .

@hiddeco
Copy link
Member

hiddeco commented Jan 22, 2021

The helm-controller does not support this at present, nor did the Helm Operator.

I have been playing around with proof of concepts to determine changes based on the manifest blob that is stored in the Helm storage based on 2 and 3-way merges, but ran into some issues around mutating cluster webhooks that may modify the object when it is applied on the cluster by Helm.

That being said (and done), I do want to eventually support this, but do not have a time frame on when.

@brondum
Copy link

brondum commented Mar 17, 2021

@hiddeco any updates, just realised that a deployment was deleted but helm was yay happy.
Could be super cool if flux could detect the state was undesired :)

@ppodevlabs
Copy link

we are looking forward to have this feature!! Any updates?

@cmd-werner-diers
Copy link

Facing the same issue. Reconciliation does not work with resources deployed by Helm.

@rxbn
Copy link

rxbn commented Apr 24, 2021

i'm also looking forward to have this feature! what's the current status?

@tmk1991
Copy link

tmk1991 commented Jun 8, 2021

You could try to do a cronjob with manual reconciliation

flux reconcile hr {{your hr}} --with-source -n {{your ns}}

@Jeinhaus
Copy link

Jeinhaus commented Jul 12, 2021

The only way currently to restore the helmrelease to the defined state is to reinstall or upgrade it, or am I missing something?

@Legion2
Copy link

Legion2 commented Sep 8, 2021

I deleted a crd (and all its crs) and recreated it, however helm controller does not recreate the objects from the helm release. Now my cluster is in an inconsistent state and it does not recover. Even if I call flux reconcile helmrelease it does not recreate the deleted resources from the helm template.

@hiddeco
Copy link
Member

hiddeco commented Oct 1, 2021

After consultation with the Helm maintainers, the blue print for this is in: https://github.com/bacongobbler/helm-patchdiff/blob/8cbbc9b4716330fd4e69b5fea93320fa5da1c352/main.go#L289-L294. This could be implemented after the planned work in https://github.com/fluxcd/helm-controller/milestone/1 has landed (and our testing suite has been strengthened).

@jasonyic
Copy link

flux suspend ..., then flux resume ... will do the trick to recreate the resource which being deleted manually

@mcammisa78
Copy link

we are looking forward to have this feature, Any updates about when it could be available?

@pkit
Copy link

pkit commented Dec 23, 2021

Should be there in the docs: the moment you start using HelmRelease = you lose all the GitOps flow

@mcammisa78
Copy link

Any news about this issue? @hiddeco is there any plan for supporting reconciliation in case of changes manually applied by user ?

@kingdonb
Copy link
Member

kingdonb commented Mar 31, 2022

This topic comes up a lot, I wanted to reiterate these thoughts I placed on #267 originally, but while they don't exactly belong on that discussion, they can perhaps go here:


Helm uses a 3-way merge to incorporate: (1) the old manifest, (2) its live state on the cluster, and (3) the new manifest.

Helm also provides various opportunities for the installing agent to manipulate the state before, around and after install, both within the chart (lifecycle hooks) that are permitted to be non-deterministic (and so may have outputs that are not declaratively formed), and also outside of the chart, with Post Render behavior, that outside of Flux, would also be capable of non-deterministic behavior.

Simplified: The installer can create secrets on the first run (helm install), that won't be generated again during a second run (helm upgrade).

Though Flux makes the postrender behavior declarative, it's still running the Helm code under the hood, which does not assume declarativeness, and that means we can't assume anything that Helm itself would not have assumed or guaranteed.

Every upgrade produces a new release artifact, in the form of a secret of the type: helm.sh/release.v1

So when you have just installed for the first time and you have a handful of Helm charts, your cluster-wide secrets may look like this:

chartmuseum           sh.helm.release.v1.chartmuseum-chartmuseum.v1                  helm.sh/release.v1                    1      4d13h
deis                  sh.helm.release.v1.deis-hephy.v1                               helm.sh/release.v1                    1      4d13h
harbor                sh.helm.release.v1.harbor.v1                                   helm.sh/release.v1                    1      4d13h
ingress-nginx         sh.helm.release.v1.ingress-internal.v1                         helm.sh/release.v1                    1      4d13h
ingress-nginx         sh.helm.release.v1.ingress-public.v1                           helm.sh/release.v1                    1      4d13h
kube-oidc-proxy       sh.helm.release.v1.kube-oidc-proxy.v1                          helm.sh/release.v1                    1      4d11h
minio-lusers          sh.helm.release.v1.minio-lusers-minio.v1                       helm.sh/release.v1                    1      4d13h
minio-stage           sh.helm.release.v1.minio-stage-minio.v1                        helm.sh/release.v1                    1      4d13h
monitoring            sh.helm.release.v1.kube-prometheus-stack.v1                    helm.sh/release.v1                    1      2d22h
monitoring            sh.helm.release.v1.kube-prometheus-stack.v2                    helm.sh/release.v1                    1      2d22h
openvpn               sh.helm.release.v1.openvpn.v1                                  helm.sh/release.v1                    1      4d13h

Three-way diffs are complicated, even more complicated than three-way merges. Now imagine instead of comparing the inputs to the outputs, Helm controller just runs an upgrade each time, producing a new revision and a new versioned secret.

The two secrets shown here for v1 and v2 revisions in the monitoring namespace will quickly fall off the edge, as new release secrets are generated upon each upgrading reconcile, and we've effectively made Helm's rollback feature useless. Helm controller can only infer from the changing state in any inputs, that it should perform an upgrade.


I'm not sure if philosophically this can mesh with the "no imperative magic" philosophy of GitOps, that expects all changes to be declarative, but if there was a safe place to put such an annotation in the spec that was guaranteed not to have any other side-effect on anything down-stream other than to force a single reconciliation that would definitely result in an Upgrade as long as it succeeds, it would certainly make sense for the CLI to include a way to trigger that behavior. Part of the problem is that you can't guarantee nobody is watching.

We usually recommend people who are struggling with this issue to find the remediationStrategy in the spec for install and upgrade, and try fine-tuning it. The default behavior is to rollback when upgrade fails, but it is not very tolerant of failures on install, as it can not roll back to a previous release in the event of install failure; users may want to adjust the timeout upwards or downwards so that failures are less touchy (or don't wait so long, the default 5m), or to retry a fixed number of times even when install fails, but those behaviors are not the defaults to avoid an unexpected interaction from trying again/an infinite number of times until success causing havoc.

https://fluxcd.io/docs/components/helm/helmreleases/#configuring-failure-remediation

The least wrong way to approach this now, is to include some values from a configMap that will not be read by the chart:

https://fluxcd.io/docs/guides/helmreleases/#refer-to-values-in-configmaps-generated-with-kustomize

and update the configmap with a new nonce when you want to do the force-upgrade.

Some users have reported they can trigger an upgrade from the failed state when running flux suspend helmrelease <hr> and flux resume helmrelease <hr> but this is not reliable in my experience, depending too much on the type of failed state that your helmrelease landed in.


Understanding that none of this directly addresses the original topic "why doesn't Helm Controller resolve drift in my cluster" it seems that many people who are asking for that behavior, really are willing to use a "Closed Loop" and don't expect drift to come into their cluster from random places, but they are just looking for infinite retries and failure remediation, and those are all options which can be configured... though it is not the default behavior to retry infinitely, and for some very good reasons.

@pkit
Copy link

pkit commented Apr 1, 2022

philosophically

My philosophical answer so far: rewrite all the charts (using helm template + manual changes) into kustomizations.

@vl-shopback
Copy link

if fluxcd works like this, then how does flagger handle the blue/green deployment?

@kingdonb
Copy link
Member

kingdonb commented Apr 4, 2022

Flagger uses Git to facilitate rollback, rather than Helm.

https://github.com/fluxcd/flagger/blob/main/docs/gitbook/tutorials/canary-helm-gitops.md

Although you can use Flagger and Helm together, the canary in this example is inside of the Helm chart, not around or outside of the Helm Release. So when the canary fails, the traffic continues routing to the old primary which is not affected or upgraded (it's the prior state of the Helm release's deployments, associated refs from the canary spec, HPAs, etc.).

You move on to the next state the same as with any Flagger canary, by either git-reverting and allowing a new canary to run again with the old release, or perhaps more likely and more beneficial, by letting the Canary keep the stable release going, and pushing another release after this with the fix to whatever issue you observed in the canary.

I don't think you really would likely want to configure automatic rollback or other failure remediations together with Flagger. Actually, I'm pretty sure you'd want to disable remediation in that case.

@hiddeco hiddeco changed the title Reconcile resources in HelmChart periodically if updated by user? Cluster-state manifest drift correction Jun 7, 2022
@hiddeco hiddeco added the enhancement New feature or request label Jun 7, 2022
@hiddeco hiddeco self-assigned this Jun 7, 2022
@stefanprodan stefanprodan pinned this issue Jun 7, 2022
@hiddeco
Copy link
Member

hiddeco commented Jun 7, 2022

This is on my to-do list as part of an overhaul of the general logic of the reconciler. Given the changes that at times occur to my day-to-day priorities, I can not provide an ETA at this time other than "next (next) minor release with more bigger changes".

@antaloala
Copy link

antaloala commented Jun 21, 2022

If k8s kustomize supported a new option in configMapGenerator to instruct "kustomize build" to set a new .metadata.name for the generated ConfigMap object each time it is invoked, even when the ConfigMap's .data remains the same (e.g. adding some timestamp suffix to the ConfigMap object name)) that would really help here: getting a HelmRelease reconciliation each time the Flux2 Kustomization manifest wrapping the HelmRelease objct is reconciled (as set by its .spec.interval)

  • I am assuming here the Flux2 HelmRelease gets the deployment specific values from that k8s kustomize generated ConfigMap, as declared in the k8s kustomization.yaml pointed from the flux2 kustomization manifest (using this new option that would be provided by k8s kustomize).

This would require low effort from Flux2, i.e. compiling it against the new kustomize pkgs (provided from k8s guys) supporting that new configMapGenerator option.

Could this be a possible solution to try here?

@kingdonb
Copy link
Member

kingdonb commented Jun 21, 2022

If it's a new feature from Kustomize upstream, I'd imagine it will be supported by Flux if possible, but if it acts as you describe I think it's not going to have the intended effect, instead something more like a new deployment rev or new instance of the job every single spec.interval – that would be less than ideal.

I guess that you wanted this in order to achieve drift correction with Helm only with the cadence of Kustomize controller's reconciliation, since that is the topic here. Hypothetically it would be possible, with some of the memory and performance advancements I think Helm controller today is a lot better equipped for this than Helm controller at any time from 3-12 months ago thanks to caching and other advances. Not sure this proposed approach is the way to go about it though.

I'm not sure if the result will likely be good for Helm secret storage as the retention will run any meaningful changes off the history in short order with an endless stream of near-noop releases, having one being promoted every interval. The design of the controller is intentional about not running up the release counter when nothing really has changed, so in the event it's needed, Helm rollbacks from Helm secret will still be meaningful and useful more than 10 minutes after the fact of a release.

Since there are changes, and revisions are immutable, the new configmap in the definition of each container will likely also trigger a rolling of the deployment or a retrigger of the job if any, etc. for various reasons I think this solution as you propose is not ideal, a better solution would have ideally found a way to achieve drift detection without this side-effect of spamming the release history and potentially also with a better awareness of the Helm lifecycle hooks (maybe even user-facing control of whether re-executions should retrigger the hooks or not, I don't know how granularly this is possible or even if at all).

I'm not sure how much of that's planned now, but it is in the milestone for Flux 2.0 GA so I think broadly we can count on this hopefully being resolved in a way that is a bit more ergonomic and no-surprises than adding a purely-random-suffix feature from Kustomize upstream.

@antaloala
Copy link

antaloala commented Jun 22, 2022

Yes, I am looking for an automated way (i.e. without the need for any manual "git commit" to be applied) to make the .spec of a Flux2 HelmRelease object to "change" so the helm-controller has to reconcile that HelmRelease object and so handle the possible drift correction (e.g. someone applied kubectl delete command on an API object "owned" by the related Helm release).

If I run a helm upgrade command on an existing Helm release (in a given namespace) passing (1) the same Helm chart currently applied in that Helm release with (2) the exact deployment specific Helm param settings (i.e. values) I used in my previous helm upgrade --install command => Yes I get a new revision of the Helm release (reflected in the new Helm related k8s Secret object) BUT no any patch is applied by Helm's k8s API client on the Helm release owned k8s API objects (as the rendered yamls are the same) ... except if someone manually modified the Helm released owned API object (with those kubectl commands)

Note the ConfgMap I proposes to get a new name for each time flux2 Kustomization is reconciled (as instructed by its .spec.interval) is the one referred from the k8s HelmRelease manifest wrapped by that flux2 Kustomization manifest => so no any Deployment/StatefulSet/DaemonSet's pods in the scope of the Helm release will be re-created if the content of that ConfigMap is the same as it was in the previous flux2 Kustomization reconciliation (as the Helm param settings to apply for Helm rendering are the same).
Only the pre/post upgrade Jobs will be triggered ... and I would expect they to be "prepared" for not doing/changing anything when not really needed (if doing it means a poor pre/post upgrade Helm hooks design as that Helm charts is not prepared to run a helm upgrade on the same chart and values nowadays deployed). But I understand we should not avoid/preclude potential new features because of poor design of some Helm charts ...
But, anyway, Flux2 helm-controller could still cope with this: when realizing the helm upgrade is to be run on the same Helm chart currently deployed with the exact same Helm param settings previously applied: it could automatically run the upgrade without invoking the pre/post hooks (the --no-hooks option supported in helm install/upgrade/rollback/uninstall commands that I presume is also available in the Helm SDK, I did not check that ...).

P.S: Just to clarify: I do not know about any intention from k8s Kustomize guys to add something like this in the configMapGenerator (it is nothing more than a "possible request" to them).

@antaloala
Copy link

antaloala commented Jun 22, 2022

Actually I can simulate all this adding, to the k8s kustomization.yaml's configMapGenerator two files:

  1. the (being there from the beginning) "values.yaml" file including the deployment specific Helm param settings (values) to apply at helm rendering time
  2. another (just added) "discard.yaml" I use to force the configMapGenerator to generate a new ConfigMap object when its content is change (without touching the values.yaml used above)

As the HelmRelease .spec.valuesFrom points to a different ConfigMap each time I change (and git commit) the content of "discard,yaml" (once the flux2 kustomization reconciliation runs) => a new reconciliation is run by the helm-controller on that updated HelmRelease object and so kubectl changes applied on the API objects "owned" by that Helm release are "repaired" even when keeping untouched the referenced Helm chart version (HelmRelease object's .spec.chart.spec stanza) and keeping untouched the content of the values.yaml file to apply.
If I instead change whatever other file in the git repo that is not referred from the k8s kustomization.yaml placed in the git repo path pointed from the flux2 Kustomization manifest -> the Helmrelease reconciliation does not run (as expected, as the HelmRelease .spec did not change) so the potential drift is not "repaired".

The k8s Kustomize configMapGenerator feature commented above would be only a "trick" to be able to get the "desired" change in the HelmRelease object .spec stanza without having to apply any manual step (i.e. git commit ...).

@antaloala
Copy link

@kingdonb you wrote above:

The design of the controller is intentional about not running up the release counter when nothing really has changed, so in the event it's needed, Helm rollbacks from Helm secret will still be meaningful and useful more than 10 minutes after the fact of a release.

I do not see how this proposed behavior could spoil the current installRemediation / upgradeRemediation behavior .... (???)

@antaloala
Copy link

antaloala commented Jun 22, 2022

Additionally, as part of the tests I am running, I just modified the .spec.interval of a HelmRelease object and it triggered a reconciliation from the helm-controller (I understand triggered from the mismatch between the HelmRelease object's .metadata.genetion and the .status.observedGeneration) what makes sense to me => a helm upgrade was applied using the exact same Helm chart and values as currently deployed (as I did not change anything else).

But then all the arguments against running a HelmRelease reconciliation when chart and values are kept the same could apply here ... but I think we all agree the reconciliation applied by the helm-controller because of whatever change in the .spec stanza is right in this case. Why then it would be wrong to do the same periodically (if so set/requested in an enhanced HelmRelease object) to correct possible drifts and so getting a perfect also at Helm level GitOps handling from Flux2?

@kingdonb
Copy link
Member

kingdonb commented Jun 24, 2022

the current installRemediation / upgradeRemediation behavior

This is not the only thing we have to worry about nor the only functional use of the helm history. Strict compatibility with the Helm client is a highly important goal for Helm Controller and we have been able to preserve that strong compatibility until now. We have this guarantee for users that they can come and go from Helm Controller as they please, and no other GitOps tool offers this today; the Helm history is maintained exactly as if you created a release using the helm CLI and upgrades from Helm Controller look exactly like upgrades from a Helm CLI user, they are functionally indistinguishable.

Users can execute a manual rollback after the release is marked Succeeded (differs from the built-in remediation which only executes when a release is Failed) and they will get back to the state they were in before the upgrade. That's what I'm suggesting would break if we simply run the helm upgrade logic in a loop every spec.interval – the helm history keeps the result of each release in a secret of type helm.sh/release.v1 and you get a new revision each time an upgrade is executed, whether it changes anything or not.

NAME                               TYPE                 DATA   AGE
sh.helm.release.v1.my-release.v1   helm.sh/release.v1   1      15d
sh.helm.release.v1.my-release.v2   helm.sh/release.v1   1      15d
sh.helm.release.v1.my-release.v3   helm.sh/release.v1   1      15d
sh.helm.release.v1.podinfo.v1      helm.sh/release.v1   1      15d
sh.helm.release.v1.podinfo.v2      helm.sh/release.v1   1      15d
sh.helm.release.v1.podinfo.v3      helm.sh/release.v1   1      15d

Drift detection should not break the function of helm history or a manual helm rollback which allows users to manage releases manually after suspending the HelmRelease, so their changes are not overwritten. Sure users can do rollbacks in the GitOps way, but we have historically always protected the capability of reversion to the old way of managing HelmReleases during an incident or when there is some other need.

If Helm Controller is going to perform drift detection and preserve the state of each release according to GitOps, it must not break this invariant in the process. That's all I'm imputing here, I am not the expert on this and I'm simply commenting as a passenger, not as a decision maker. I think the decision has already been made to implement this.

I do not know the details of those decisions. However, I know what will go wrong if we simply run upgrade on every interval; please take a look at --history-max in the Helm manual (or MaxHistory in the Flux API docs) Helm will only maintain the most recent 10 releases in history by default. History that is older than 10 releases will not be saved.

We had a bit of discussion about this at the most recent Flux Dev Meeting in public, it was recorded, and should appear on https://www.youtube.com/channel/UCoZxt-YMhGHb20ZkvcCc5KA soon (but I see we are a bit behind our backlog for uploading these recordings.) The short version is, the feature is planned to be implemented, and there are these hurdles.

@antaloala
Copy link

antaloala commented Jun 24, 2022

Thanks @kingdonb for the time you are putting on this (really appreciated ;-)

Yes, I know about the per-Helm_release_revision k8s Secret objects and the --history-max limits, things you mention above being the drawbacks/cons from my proposal ... they being like the "price to pay" (at least in my mind) in order to assure a potential drift (due to some kubectl delete/patch like command applied on API objects owned by the involved Helm release) is solved in the next HelmRelease reconciliation.

But I am more than happy if a coming flux2 (helm-controller) release is solving it in a different way that does not add those additional drawbacks ;-)

Thanks again for sharing your view and future flux2/helm-controller plans !

@sojinss4u
Copy link

sojinss4u commented Sep 8, 2022

I am also looking for this feature implemented, as automatic reconciliation was the main feature due to which we started using Flux for our K8s deployments. I came to know about this recently when some one manually scaled down the replica count of a deployment to 0 & flux didn't reconcile it back to the previous state, until we suspend/resume the HelmRelease (would be good if we can clearly call this out in Flux Docs).
@kingdonb @hiddeco Thanks for sharing the plan of implementing this feature in Flux 2.0. I can see that suspending/resuming a HelmRelease does a reconciliation for the corresponding deployment & revert any manual changes done. However it also has all the drawbacks which you mentioned previously for the configMap like loosing HelmRelease versions / triggering LifeCycle hooks etc.

However if I am ok,

  1. Loosing the ability to revert a HelmRelease from cli
  2. Make sure that life-cycle hooks present in the HelmChart will not cause any issues during periodic reconciliation

Do you see any other challenges/unforeseen issues in suspending the HelmReleases & resuming it periodically using a script. Since this is already getting implemented in Flux 2.0, once the feature is out, I can just discard this script & start using Flux native feature. Since we are having lot of deployments which are managed via Flux HelmRelease, manually adding/reverting the configMap option shared above might be difficult (or making sure that everyone remember to add it).

@kingdonb
Copy link
Member

kingdonb commented Sep 9, 2022

I can see that suspending/resuming a HelmRelease does a reconciliation for the corresponding deployment & revert any manual changes done. However it also has all the drawbacks which you mentioned previously for the configMap like loosing HelmRelease versions / triggering LifeCycle hooks etc.

That sounds like a solid understanding of the issues at play 👍

Do you see any other challenges/unforeseen issues in suspending the HelmReleases & resuming it periodically using a script.

Generally the main issue is that someone is making changes outside of the gitops delivery, which is more like what I'd consider a root cause here. Does this happen all the time? I tell people when they're using GitOps, they should treat kubectl apply -f or kubectl edit as a "break glass/emergency" tool.

You should have monitoring in place, such that when this person has scaled down the important deployment, an SRE/op gets notified and someone has to acknowledge the issue. If you have monitoring but it doesn't let you know when this important service has no backends/endpoints then it's an issue that should be resolved on the monitoring side. When you follow the Flux prometheus guide, you get a lot of these building blocks – I'm not sure how one configures Prometheus to let you know when a service has no backends, but I'm sure there must be a way. (Wish I could be more helpful there...)

The script approach doesn't have any drawbacks I can think of other than the ones you've mentioned. There is some performance impact, because each HelmRelease upgrade has to load the chart and will consume a lot of resources while the upgrade is in progress.

I definitely would expect the script to run every (N) hours rather than every (N) minutes for this reason. But if you are on the latest version of Flux, many of the performance issues will have been mitigated. It really depends a lot on how your sources are arranged and whether you've adopted the new type: oci HelmRepository, which is way more efficient than others that came before.

The monitoring that I mentioned will give you an idea of how great is the performance impact, if you've configured it according to the Flux monitoring guide.

This change supporting drift correction in Helm Controller is expected Real Soon Now, so please be prepared to throw the script away very soon. 🥸👩‍💻

@antaloala
Copy link

@kingdonb do you know/have any estimation on when this feature could be in Flux2's helm-controller?

@benjimin
Copy link

benjimin commented Feb 1, 2023

In flux v1, if a deployment spec was known to have been manually edited in the cluster, we would kubectl delete hr/<application> --cascade=false. This would prompt flux to recreate the HelmRelease object in the cluster, triggering Helm to revert any drift with minimal service disruption (and without needing dummy git commits to the HelmRelease manifest).

Is this still the way flux2 works?

@artazar
Copy link

artazar commented Feb 1, 2023

In flux v1, if a deployment spec was known to have been manually edited in the cluster, we would kubectl delete hr/<application> --cascade=false. This would prompt flux to recreate the HelmRelease object in the cluster, triggering Helm to revert any drift with minimal service disruption (and without needing dummy git commits to the HelmRelease manifest).

Is this still the way flux2 works?

@benjimin yes it works too + there's a sequence of flux suspend hr and flux resume hr CLI commands that will also fix the drift.

@kingdonb
Copy link
Member

kingdonb commented Feb 1, 2023

And there is a third way, for when you really don't want to incur downtime, if you have a failed release with upgrade retries exhausted then you can delete the latest (failed) version of the helm secret (which Helm itself uses for release accounting under the hood) and then reconcile the HelmRelease to try again. This works when it's "super stuck" but in general it also works fine to just suspend and resume the helmrelease, that gets you "another try" from the installing reconciler.

@hiddeco
Copy link
Member

hiddeco commented Feb 20, 2023

This feature is planned to be experimentally supported in the next minor release, after which it will be enabled by default once we have landed things like the long due #532 (and follow ups on this). The reason it will be first experimentally supported is that:

  1. I want to ensure Helm itself doesn't cause any weirdness with the already proven approach from kustomize-controller in terms of unexpected behavior when combined with Helm.
  2. We really need to untangle things and improve the ways we work with the Helm storage before we can deem it a reliable source of truth for drift correction, which is where reconcile: allow atomic reconciliation of release #532 comes in.

On a second note, I want to let you know that for the foreseeable future (or unless things of utter priority come up) the helm-controller now has my full attention. Which took some time after we reshaped our roadmap to focus on the Flux Source APIs first (and stop running around like crazy), but means that I am at present "ahead of schedule".

@hiddeco
Copy link
Member

hiddeco commented Mar 10, 2023

Closing this in favor of #643. Please provide (short) feedback there, or create specific issues for problems you're experiencing. Thank you all!

@hiddeco hiddeco closed this as completed Mar 10, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Since Last Dev Meeting in Maintainers' Focus Mar 10, 2023
@hiddeco hiddeco unpinned this issue Mar 10, 2023
@lorenzo-w
Copy link

lorenzo-w commented Dec 29, 2023

flux suspend ..., then flux resume ... will do the trick to recreate the resource which being deleted manually

It seems that above solution doesn't work anymore, but using HelmRelease.spec.driftDetection.mode from #643 worked for me, so it's all fine! :)

Thank you!

@souleb souleb moved this from Since Last Dev Meeting to Done in Maintainers' Focus Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests