Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor release reconciliation #166

Closed
wants to merge 4 commits into from
Closed

Refactor release reconciliation #166

wants to merge 4 commits into from

Conversation

hiddeco
Copy link
Member

@hiddeco hiddeco commented Nov 26, 2020

No description provided.

Copy link
Contributor

@seaneagan seaneagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial feedback and questions after a first pass.

controllers/helmrelease_controller_release.go Outdated Show resolved Hide resolved
controllers/helmrelease_controller_release.go Outdated Show resolved Hide resolved
makeRelease, remediation := run.Install, hr.Spec.GetInstall().GetRemediation()
successReason, failureReason := v2.InstallSucceededReason, v2.InstallFailedReason
if rls != nil {
makeRelease, remediation = run.Upgrade, hr.Spec.GetUpgrade().GetRemediation()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of a failed install which was not remediated (uninstalled), I believe this will lead to allowing an upgrade even though the install failed, which is not what we want.

Copy link
Member Author

@hiddeco hiddeco Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change it to rls != nil && hr.Status.LastSuccessfulReleaseRevision > 0, the install strategy is picked for HelmRelease resources that are "taking over" a release, and may result in an accidental wipe of all release data if that operation fails.

controllers/helmrelease_controller_release.go Outdated Show resolved Hide resolved
controllers/helmrelease_controller_release.go Outdated Show resolved Hide resolved
controllers/helmrelease_controller_release.go Show resolved Hide resolved
api/v2beta1/helmrelease_types.go Show resolved Hide resolved
v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries")
return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil
// Our previous remediation attempt failed, skip release to retry.
case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:
case hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision:

in order to handle the case of a failed install and subsequent failed uninstall remediation.

controllers/helmrelease_controller_release.go Show resolved Hide resolved
@@ -97,12 +110,13 @@ func (r *Runner) Rollback(hr v2.HelmRelease) error {
rollback.Force = hr.Spec.GetRollback().Force
rollback.Recreate = hr.Spec.GetRollback().Recreate
rollback.CleanupOnFail = hr.Spec.GetRollback().CleanupOnFail
rollback.Version = hr.Status.LastSuccessfulReleaseRevision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some cases where LastSuccessfulReleaseRevision will not be set, since the helm-controller did not create the release:

  1. User manually fixes a helm release, perhaps via a suspend/resume flow.
  2. There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.

In these cases, I assume we would want to just rollback to the immediately previous revision, as is the default and what we were doing before.

Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date? I would think it would be safer to just rollback to the immediately previous state in this case as well, and force the user to rollback in e.g. git if they want to rollback to some state earlier than the one which existed immediately before the failed upgrade.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.

If rollback.Version is set to 0, Helm will roll it back to the previous revision.

User manually fixes a helm release, perhaps via a suspend/resume flow.

Given this creates "revision drift", it would always trigger a new release and I would expect this to succeed because of the fixes performed by the user (and later reflected to the chart and/or HelmRelease before resuming).

Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date?

I am hesitant about this because of the following:

  1. We have little knowledge what the contents of the previous revision are, or if it has been tampered with.
  2. We have little knowledge about the state of previous revision before the current revision was created (was it e.g. StatusFailed?), as any "previous" revision gets a StatusSuperseded from Helm.
  3. Our "revision bookkeeping" model would make less sense, as we would detect mutations but allow rollbacks to the previous state if an upgrade on top of this detection fails. Given we then mark the revision of the rolled back release as the new revision, the controller basically would verify state it did not create itself.


// Observe the last release. If this fails, we likely encountered a
// transient error and should return it to requeue a reconciliation.
rls, err := run.ObserveLastRelease(hr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still account for out-of-band helm storage updates e.g. via manual helm commands? It seems like it is now only aware of the releases made by the controller itself, in which case we can't be sure it's really the latest release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObserveLastRelease retrieves the last release from the Helm storage, and this may return a manually made release, the "cached" version is returned by GetLastPersistedRelease.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got confused between ObserveLastRelease and GetLastObservedRelease whose names make it seem like they are more closely related than they are.

controllers/helmrelease_controller.go Show resolved Hide resolved

// Delete deletes a release or returns driver.ErrReleaseNotFound.
func (o *Observer) Delete(key string) (*release.Release, error) {
return o.Driver.Delete(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set o.release = nil here?

Copy link
Member Author

@hiddeco hiddeco Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely not, because that would result in garbage collection of the cached release when the max number of revisions is reached if we do not take the key somehow into account. Sadly, the knowledge about the key format is private, which is why I was hesitant in utilizing it.

What we may be able to do is to store the o.release as a key/value pair, that may also be an enabler for #166 (comment), if we add an additional deleted flag to keep the release in cache while maintaining knowledge about the current state in the Helm storage.

internal/storage/observer.go Show resolved Hide resolved
internal/storage/observer.go Show resolved Hide resolved
// If this release was already marked as successful,
// we have nothing to do.
if hr.Status.LastReleaseRevision == hr.Status.LastSuccessfulReleaseRevision {
return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we always be setting requeueAfter: hr.Spec.Interval.Duration here and elsewhere when there is nothing to do, to ensure we don't end up using whatever the default value is? Or alternatively should we override to that duration after running each step if there were no errors and the requeueAfter is zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter has my preference I think.

Signed-off-by: Hidde Beydals <[email protected]>
Copy link
Contributor

@seaneagan seaneagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through this, but if I understand correctly (didn't actually test), the install/upgrade retry exhaustion now no longer applies to test and remediation actions. So we can now have infinite retries of remediation actions (uninstall/rollback), which could cause the release revisions to rev to infinity, in the case of repeatedly failed rollbacks. Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that. It also may lead to problems if one is not using the appropriate hook deletion policies on their helm tests. It also seems like we could be testing an old release we rolled back to instead of the current desired state.

}

func (r *HelmReleaseReconciler) reconcileTest(ctx context.Context, log logr.Logger, run *runner.Runner, hr *v2.HelmRelease) (ctrl.Result, error) {
// If this release was already marked as successful,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If this release was already marked as successful,
// If the last release made by the controller was already marked as successful,

I think this makes it more clear that the release may have been from a previous reconciliation, not necessarily this one since there may have been no install/upgrade needed, due to no changes or exhausted retries, or the install/upgrade may have failed without creating a release due to e.g. a chart rendering issue.

@hiddeco
Copy link
Member Author

hiddeco commented Dec 2, 2020

@seaneagan noted, and thank you for taking the time.

I am open to suggestions for improvements and/or alternative approaches, as the issues you have brought up do seem valid (but preventable) at this moment (although it is late here, and I need to reweigh them again tomorrow with fresh energy).

Given the goal of this PR is to improve testability, my next item on the list for this PR is to create Go tests for the reconciler actions so that we can properly test and cover the uncertainties you describe.

First step to achieve this is to untangle some of the Runner code to make it possible to inject Helm's fake kube client and their memory based storage driver, after that it should be fairly easy to mock state and confirm the reconciliation outcome.

Besides this, I will make sure to rewrite some of the in-code doc blocks based on your comments and questions, as they will probably help others whenever they are trying to understand how it all works.

return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil
// We have exhausted our retries.
case remediation.RetriesExhausted(*hr):
v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were reflecting the Released condition reason here, which is useful expecially for tooling which looks specifically at the Ready reason, and outputs its reason/message, specifically kstatus and kpt.

This makes it possible to inject a custom Helm action configuration,
which is useful for tests where you want to be able to inject a (fake)
test client and/or different (memory) storage driver.

Signed-off-by: Hidde Beydals <[email protected]>
@hiddeco
Copy link
Member Author

hiddeco commented Dec 3, 2020

the install/upgrade retry exhaustion now no longer applies to test and remediation actions

It still does? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L276-L280

Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that.

I think this actually all captured by the state observations we do on the release object itself? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L183-L200

@seaneagan
Copy link
Contributor

the install/upgrade retry exhaustion now no longer applies to test and remediation actions

It still does? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L276-L280

Sorry, not sure how I missed that. One scenario I'm still not sure about, if we have say 5 upgrade retries enabled, and the first upgrade fails, and then the rollback repeatedly fails, the upgrade retries will thus be skipped, and thus the release revision will continue to rev to infinity for each new failed rollback. I assume solving this would mean adding a retry limit (whether configurable or not) and failure count status field for rollbacks. I don't think uninstall has the same issue since a failure to uninstall shouldn't rev the release revision, although perhaps it makes sense to stop retrying at some point anyways, not sure.

Also when a release is not remediated, either intentionally (remediateLastFailure=false), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that.

I think this actually all captured by the state observations we do on the release object itself? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L183-L200

Missed that as well, thanks.

@hiddeco
Copy link
Member Author

hiddeco commented May 6, 2022

Closing in favor or work started in #477. Thank you, the discussion have been fruit for thought.

@hiddeco hiddeco closed this May 6, 2022
@stefanprodan stefanprodan deleted the release-refactor branch May 8, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants