Add 'none' update reconciliation strategy #268

dippynark · 2021-05-27T20:01:58Z

Originally raised on Slack: https://cloud-native.slack.com/archives/CLAJ40HV3/p1622138440000400

What happened?

We are using the helm-controller’s HelmRelease resource (v2beta1) and the Nginx Ingress Controller. All the Pods of the Nginx Ingress Controller were down and we upgraded a chart that included an Ingress resource; the upgrade failed due to the call the Nginx Ingress Controller's webhook failing and then rollback remediation failed for the same reason. This meant that the HelmRelease was stuck even though we have retries: -1 for upgrade remediation. Looking at the code it seems that if remediation fails once then the helm-controller stops forever:

helm-controller/controllers/helmrelease_controller.go

Line 320 in 577925c

// Fail if the previous release attempt remediation failed.

This same situation could happen for any webhook that becomes temporarily unavailable.

Alternatives

Create a CronJob that runs helm rollback on HelmReleases that have failed their rollback. Of course this is not ideal and remediation logic should stay within the controller.

Run a CronJob to remove conditions for stuck HelmReleases.

The text was updated successfully, but these errors were encountered:

hiddeco · 2021-07-01T23:34:16Z

If I am following you correctly you want to retry indefinitely without remediating? What would you think of Retry as a strategy name?

If the upgrade fails, do you expect it to requeue on the defined interval, or a retry exponential backoff, something else?

It seems this was possible using the helm-operator, so maybe there was a reason for removing this feature: docs.fluxcd.io/projects/helm-operator/en/stable/helmrelease-guide/rollbacks/#enabling-retries-of-rolled-back-releases

If no retries for rolled back releases are enabled, they require human intervention as well to continue the release process.

dippynark · 2021-07-05T08:16:00Z

Ideally it could retry with remediation, but currently failed remediation causes the HelmRelease to get stuck and never retry again, so omitting remediation would be a quick fix to this. In general, the helm-controller should be able to drive towards the desired state despite transient failures, however the current implementation is very brittle to these kinds of issues.

Retry doesn't make sense to me as the subsequent upgrade is the retry, not the lack of remediation.

Re-queue and with exponential backoff yes. The interval field should just be used to detect config drift imo.

kishoregv mentioned this issue May 31, 2021

retry helm test if failed without having to remediate #269

Open

Jeinhaus mentioned this issue Jul 12, 2021

Retry upgrade without changing the helmrelease #297

Closed

hiddeco mentioned this issue Oct 1, 2021

Do not rely on previous observations #324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'none' update reconciliation strategy #268

Add 'none' update reconciliation strategy #268

dippynark commented May 27, 2021 •

edited

Loading

hiddeco commented Jul 1, 2021

dippynark commented Jul 5, 2021 •

edited

Loading

Add 'none' update reconciliation strategy #268

Add 'none' update reconciliation strategy #268

Comments

dippynark commented May 27, 2021 • edited Loading

What happened?

Suggested solution

Alternatives

hiddeco commented Jul 1, 2021

dippynark commented Jul 5, 2021 • edited Loading

dippynark commented May 27, 2021 •

edited

Loading

dippynark commented Jul 5, 2021 •

edited

Loading