Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'none' update reconciliation strategy #268

Open
dippynark opened this issue May 27, 2021 · 2 comments
Open

Add 'none' update reconciliation strategy #268

dippynark opened this issue May 27, 2021 · 2 comments

Comments

@dippynark
Copy link

dippynark commented May 27, 2021

Originally raised on Slack: https://cloud-native.slack.com/archives/CLAJ40HV3/p1622138440000400

What happened?

We are using the helm-controller’s HelmRelease resource (v2beta1) and the Nginx Ingress Controller. All the Pods of the Nginx Ingress Controller were down and we upgraded a chart that included an Ingress resource; the upgrade failed due to the call the Nginx Ingress Controller's webhook failing and then rollback remediation failed for the same reason. This meant that the HelmRelease was stuck even though we have retries: -1 for upgrade remediation. Looking at the code it seems that if remediation fails once then the helm-controller stops forever:

// Fail if the previous release attempt remediation failed.

This same situation could happen for any webhook that becomes temporarily unavailable.

Suggested solution

I understand that in some cases retrying can be unsafe, but for most cases in my opinion it’s better to keep trying instead of requiring manual intervention (after all, the chart in question had no issues, it was just an unfortunately timed upgrade).
We thought about changing remediation strategy to uninstall, but this is not ideal as it would cause downtime for our application (and would also delete CRDs, potentially causing more problems).

Ideally, we could set no remediation strategy and just attempt the upgrade indefinitely (this feels more closely aligned to the GitOps methodology), so I think a good solution would be implement a none upgrade reconciliation strategy which always succeeds, allowing update retries to continue.

It seems this was possible using the helm-operator, so maybe there was a reason for removing this feature: https://docs.fluxcd.io/projects/helm-operator/en/stable/helmrelease-guide/rollbacks/#enabling-retries-of-rolled-back-releases

In general the controller shouldn't be determining what to do based on its own status, rather it should be using the state of the world and the desired state in the spec to make a decision (see API Conventions). Currently this controller is not able to recreate its status since it depends on observing the result of a particular action:

func (r *HelmReleaseReconciler) handleHelmActionResult(ctx context.Context,

Alternatives

Create a CronJob that runs helm rollback on HelmReleases that have failed their rollback. Of course this is not ideal and remediation logic should stay within the controller.

Run a CronJob to remove conditions for stuck HelmReleases.

@hiddeco
Copy link
Member

hiddeco commented Jul 1, 2021

If I am following you correctly you want to retry indefinitely without remediating? What would you think of Retry as a strategy name?

If the upgrade fails, do you expect it to requeue on the defined interval, or a retry exponential backoff, something else?

It seems this was possible using the helm-operator, so maybe there was a reason for removing this feature: docs.fluxcd.io/projects/helm-operator/en/stable/helmrelease-guide/rollbacks/#enabling-retries-of-rolled-back-releases

If no retries for rolled back releases are enabled, they require human intervention as well to continue the release process.

@dippynark
Copy link
Author

dippynark commented Jul 5, 2021

Ideally it could retry with remediation, but currently failed remediation causes the HelmRelease to get stuck and never retry again, so omitting remediation would be a quick fix to this. In general, the helm-controller should be able to drive towards the desired state despite transient failures, however the current implementation is very brittle to these kinds of issues.

Retry doesn't make sense to me as the subsequent upgrade is the retry, not the lack of remediation.

Re-queue and with exponential backoff yes. The interval field should just be used to detect config drift imo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants