Refactor release reconciliation #166

hiddeco · 2020-11-26T12:42:18Z

No description provided.

seaneagan

Added some initial feedback and questions after a first pass.

controllers/helmrelease_controller_release.go

seaneagan · 2020-11-30T21:48:31Z

controllers/helmrelease_controller_release.go

+	makeRelease, remediation := run.Install, hr.Spec.GetInstall().GetRemediation()
+	successReason, failureReason := v2.InstallSucceededReason, v2.InstallFailedReason
+	if rls != nil {
+		makeRelease, remediation = run.Upgrade, hr.Spec.GetUpgrade().GetRemediation()


in the case of a failed install which was not remediated (uninstalled), I believe this will lead to allowing an upgrade even though the install failed, which is not what we want.

If we change it to rls != nil && hr.Status.LastSuccessfulReleaseRevision > 0, the install strategy is picked for HelmRelease resources that are "taking over" a release, and may result in an accidental wipe of all release data if that operation fails.

controllers/helmrelease_controller_release.go

api/v2beta1/helmrelease_types.go

Signed-off-by: Hidde Beydals <[email protected]>

seaneagan · 2020-12-01T15:50:12Z

controllers/helmrelease_controller_release.go

+		v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries")
+		return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil
+	// Our previous remediation attempt failed, skip release to retry.
+	case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:


Suggested change

case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:

case hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:

in order to handle the case of a failed install and subsequent failed uninstall remediation.

controllers/helmrelease_controller_release.go

seaneagan · 2020-12-01T16:16:06Z

internal/runner/runner.go

@@ -97,12 +110,13 @@ func (r *Runner) Rollback(hr v2.HelmRelease) error {
 	rollback.Force = hr.Spec.GetRollback().Force
 	rollback.Recreate = hr.Spec.GetRollback().Recreate
 	rollback.CleanupOnFail = hr.Spec.GetRollback().CleanupOnFail
+	rollback.Version = hr.Status.LastSuccessfulReleaseRevision


I think there are some cases where LastSuccessfulReleaseRevision will not be set, since the helm-controller did not create the release:

User manually fixes a helm release, perhaps via a suspend/resume flow.

There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.

In these cases, I assume we would want to just rollback to the immediately previous revision, as is the default and what we were doing before.

Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date? I would think it would be safer to just rollback to the immediately previous state in this case as well, and force the user to rollback in e.g. git if they want to rollback to some state earlier than the one which existed immediately before the failed upgrade.

There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.

If rollback.Version is set to 0, Helm will roll it back to the previous revision.

User manually fixes a helm release, perhaps via a suspend/resume flow.

Given this creates "revision drift", it would always trigger a new release and I would expect this to succeed because of the fixes performed by the user (and later reflected to the chart and/or HelmRelease before resuming).

Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date?

I am hesitant about this because of the following:

We have little knowledge what the contents of the previous revision are, or if it has been tampered with.

We have little knowledge about the state of previous revision before the current revision was created (was it e.g. StatusFailed?), as any "previous" revision gets a StatusSuperseded from Helm.

Our "revision bookkeeping" model would make less sense, as we would detect mutations but allow rollbacks to the previous state if an upgrade on top of this detection fails. Given we then mark the revision of the rolled back release as the new revision, the controller basically would verify state it did not create itself.

seaneagan · 2020-12-01T16:56:22Z

controllers/helmrelease_controller_release.go

+
+	// Observe the last release. If this fails, we likely encountered a
+	// transient error and should return it to requeue a reconciliation.
+	rls, err := run.ObserveLastRelease(hr)


Does this still account for out-of-band helm storage updates e.g. via manual helm commands? It seems like it is now only aware of the releases made by the controller itself, in which case we can't be sure it's really the latest release.

ObserveLastRelease retrieves the last release from the Helm storage, and this may return a manually made release, the "cached" version is returned by GetLastPersistedRelease.

Ah, I got confused between ObserveLastRelease and GetLastObservedRelease whose names make it seem like they are more closely related than they are.

controllers/helmrelease_controller.go

seaneagan · 2020-12-01T17:59:53Z

internal/storage/observer.go

+
+// Delete deletes a release or returns driver.ErrReleaseNotFound.
+func (o *Observer) Delete(key string) (*release.Release, error) {
+	return o.Driver.Delete(key)


Do we want to set o.release = nil here?

Likely not, because that would result in garbage collection of the cached release when the max number of revisions is reached if we do not take the key somehow into account. Sadly, the knowledge about the key format is private, which is why I was hesitant in utilizing it.

What we may be able to do is to store the o.release as a key/value pair, that may also be an enabler for #166 (comment), if we add an additional deleted flag to keep the release in cache while maintaining knowledge about the current state in the Helm storage.

internal/storage/observer.go

seaneagan · 2020-12-01T18:36:51Z

controllers/helmrelease_controller_release.go

+	// If this release was already marked as successful,
+	// we have nothing to do.
+	if hr.Status.LastReleaseRevision == hr.Status.LastSuccessfulReleaseRevision {
+		return ctrl.Result{}, nil


Shouldn't we always be setting requeueAfter: hr.Spec.Interval.Duration here and elsewhere when there is nothing to do, to ensure we don't end up using whatever the default value is? Or alternatively should we override to that duration after running each step if there were no errors and the requeueAfter is zero?

The latter has my preference I think.

Signed-off-by: Hidde Beydals <[email protected]>

seaneagan

Still going through this, but if I understand correctly (didn't actually test), the install/upgrade retry exhaustion now no longer applies to test and remediation actions. So we can now have infinite retries of remediation actions (uninstall/rollback), which could cause the release revisions to rev to infinity, in the case of repeatedly failed rollbacks. Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that. It also may lead to problems if one is not using the appropriate hook deletion policies on their helm tests. It also seems like we could be testing an old release we rolled back to instead of the current desired state.

seaneagan · 2020-12-02T16:47:17Z

controllers/helmrelease_controller_release.go

+}
+
+func (r *HelmReleaseReconciler) reconcileTest(ctx context.Context, log logr.Logger, run *runner.Runner, hr *v2.HelmRelease) (ctrl.Result, error) {
+	// If this release was already marked as successful,


Suggested change

// If this release was already marked as successful,

// If the last release made by the controller was already marked as successful,

I think this makes it more clear that the release may have been from a previous reconciliation, not necessarily this one since there may have been no install/upgrade needed, due to no changes or exhausted retries, or the install/upgrade may have failed without creating a release due to e.g. a chart rendering issue.

hiddeco · 2020-12-02T20:52:19Z

@seaneagan noted, and thank you for taking the time.

I am open to suggestions for improvements and/or alternative approaches, as the issues you have brought up do seem valid (but preventable) at this moment (although it is late here, and I need to reweigh them again tomorrow with fresh energy).

Given the goal of this PR is to improve testability, my next item on the list for this PR is to create Go tests for the reconciler actions so that we can properly test and cover the uncertainties you describe.

First step to achieve this is to untangle some of the Runner code to make it possible to inject Helm's fake kube client and their memory based storage driver, after that it should be fairly easy to mock state and confirm the reconciliation outcome.

Besides this, I will make sure to rewrite some of the in-code doc blocks based on your comments and questions, as they will probably help others whenever they are trying to understand how it all works.

seaneagan · 2020-12-02T20:39:33Z

controllers/helmrelease_controller_release.go

+		return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil
+	// We have exhausted our retries.
+	case remediation.RetriesExhausted(*hr):
+		v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries")


Previously we were reflecting the Released condition reason here, which is useful expecially for tooling which looks specifically at the Ready reason, and outputs its reason/message, specifically kstatus and kpt.

This makes it possible to inject a custom Helm action configuration, which is useful for tests where you want to be able to inject a (fake) test client and/or different (memory) storage driver. Signed-off-by: Hidde Beydals <[email protected]>

hiddeco · 2020-12-03T12:22:38Z

the install/upgrade retry exhaustion now no longer applies to test and remediation actions

It still does? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L276-L280

Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that.

I think this actually all captured by the state observations we do on the release object itself? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L183-L200

seaneagan · 2020-12-03T15:51:39Z

the install/upgrade retry exhaustion now no longer applies to test and remediation actions

It still does? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L276-L280

Sorry, not sure how I missed that. One scenario I'm still not sure about, if we have say 5 upgrade retries enabled, and the first upgrade fails, and then the rollback repeatedly fails, the upgrade retries will thus be skipped, and thus the release revision will continue to rev to infinity for each new failed rollback. I assume solving this would mean adding a retry limit (whether configurable or not) and failure count status field for rollbacks. I don't think uninstall has the same issue since a failure to uninstall shouldn't rev the release revision, although perhaps it makes sense to stop retrying at some point anyways, not sure.

Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that.

I think this actually all captured by the state observations we do on the release object itself? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L183-L200

Missed that as well, thanks.

hiddeco · 2022-05-06T11:29:12Z

Closing in favor or work started in #477. Thank you, the discussion have been fruit for thought.

hiddeco force-pushed the release-refactor branch 6 times, most recently from 96a6345 to bd38919 Compare November 26, 2020 19:02

hiddeco requested a review from seaneagan November 26, 2020 19:32

hiddeco mentioned this pull request Nov 30, 2020

Helm upgrade failed: another operation (install/upgrade/rollback) is in progress #149

Closed

seaneagan suggested changes Nov 30, 2020

View reviewed changes

hiddeco force-pushed the release-refactor branch from e53f377 to f1c11c0 Compare December 1, 2020 15:28

hiddeco added 2 commits December 1, 2020 17:37

Refactor release reconciliation

dbd4d77

Signed-off-by: Hidde Beydals <[email protected]>

Modify e2e checks to match changed Released cond

4eaa6a5

Signed-off-by: Hidde Beydals <[email protected]>

hiddeco force-pushed the release-refactor branch from f1c11c0 to 560eb6f Compare December 1, 2020 16:37

seaneagan suggested changes Dec 1, 2020

View reviewed changes

Address review comments

c5c58a9

Signed-off-by: Hidde Beydals <[email protected]>

hiddeco force-pushed the release-refactor branch from 560eb6f to c5c58a9 Compare December 2, 2020 10:39

seaneagan reviewed Dec 2, 2020

View reviewed changes

hiddeco force-pushed the release-refactor branch from e1cf906 to 29a8e9f Compare December 3, 2020 11:10

Add NewRunnerWithConfig constructor method

55febb2

This makes it possible to inject a custom Helm action configuration, which is useful for tests where you want to be able to inject a (fake) test client and/or different (memory) storage driver. Signed-off-by: Hidde Beydals <[email protected]>

hiddeco force-pushed the release-refactor branch from 29a8e9f to 55febb2 Compare December 3, 2020 12:00

hiddeco mentioned this pull request Jul 1, 2021

retry helm test if failed without having to remediate #269

Open

hiddeco closed this May 6, 2022

stefanprodan deleted the release-refactor branch May 8, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor release reconciliation #166

Refactor release reconciliation #166

hiddeco commented Nov 26, 2020

seaneagan left a comment

seaneagan Nov 30, 2020

hiddeco Dec 8, 2020 •

edited

Loading

seaneagan Dec 1, 2020

seaneagan Dec 1, 2020

hiddeco Dec 2, 2020

seaneagan Dec 1, 2020

hiddeco Dec 2, 2020

seaneagan Dec 2, 2020

seaneagan Dec 1, 2020

hiddeco Dec 2, 2020 •

edited

Loading

seaneagan Dec 1, 2020

hiddeco Dec 2, 2020

seaneagan left a comment

seaneagan Dec 2, 2020

hiddeco commented Dec 2, 2020 •

edited

Loading

seaneagan Dec 2, 2020

hiddeco commented Dec 3, 2020

seaneagan commented Dec 3, 2020

hiddeco commented May 6, 2022

	case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:
	case hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:

	// If this release was already marked as successful,
	// If the last release made by the controller was already marked as successful,

Refactor release reconciliation #166

Refactor release reconciliation #166

Conversation

hiddeco commented Nov 26, 2020

seaneagan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seaneagan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco commented Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

hiddeco commented Dec 3, 2020

seaneagan commented Dec 3, 2020

hiddeco commented May 6, 2022

hiddeco Dec 8, 2020 •

edited

Loading

hiddeco Dec 2, 2020 •

edited

Loading

hiddeco commented Dec 2, 2020 •

edited

Loading