-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor release reconciliation #166
Conversation
96a6345
to
bd38919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial feedback and questions after a first pass.
makeRelease, remediation := run.Install, hr.Spec.GetInstall().GetRemediation() | ||
successReason, failureReason := v2.InstallSucceededReason, v2.InstallFailedReason | ||
if rls != nil { | ||
makeRelease, remediation = run.Upgrade, hr.Spec.GetUpgrade().GetRemediation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case of a failed install which was not remediated (uninstalled), I believe this will lead to allowing an upgrade even though the install failed, which is not what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change it to rls != nil && hr.Status.LastSuccessfulReleaseRevision > 0
, the install strategy is picked for HelmRelease
resources that are "taking over" a release, and may result in an accidental wipe of all release data if that operation fails.
e53f377
to
f1c11c0
Compare
Signed-off-by: Hidde Beydals <[email protected]>
Signed-off-by: Hidde Beydals <[email protected]>
f1c11c0
to
560eb6f
Compare
v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries") | ||
return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil | ||
// Our previous remediation attempt failed, skip release to retry. | ||
case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case hr.Status.LastSuccessfulReleaseRevision > 0 && hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision: | |
case hr.Status.LastReleaseRevision != hr.Status.LastSuccessfulReleaseRevision: |
in order to handle the case of a failed install and subsequent failed uninstall remediation.
@@ -97,12 +110,13 @@ func (r *Runner) Rollback(hr v2.HelmRelease) error { | |||
rollback.Force = hr.Spec.GetRollback().Force | |||
rollback.Recreate = hr.Spec.GetRollback().Recreate | |||
rollback.CleanupOnFail = hr.Spec.GetRollback().CleanupOnFail | |||
rollback.Version = hr.Status.LastSuccessfulReleaseRevision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some cases where LastSuccessfulReleaseRevision
will not be set, since the helm-controller did not create the release:
- User manually fixes a helm release, perhaps via a suspend/resume flow.
- There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.
In these cases, I assume we would want to just rollback to the immediately previous revision, as is the default and what we were doing before.
Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date? I would think it would be safer to just rollback to the immediately previous state in this case as well, and force the user to rollback in e.g. git if they want to rollback to some state earlier than the one which existed immediately before the failed upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an existing release, such as one previously managed by the Helm Operator, and Helm controller takes it over.
If rollback.Version
is set to 0
, Helm will roll it back to the previous revision.
User manually fixes a helm release, perhaps via a suspend/resume flow.
Given this creates "revision drift", it would always trigger a new release and I would expect this to succeed because of the fixes performed by the user (and later reflected to the chart and/or HelmRelease
before resuming).
Also, if a previous reconciliation had a failed upgrade which was not rolled back, then do we really want to rollback to a previous release which may have been made a long time ago and may be out of date?
I am hesitant about this because of the following:
- We have little knowledge what the contents of the previous revision are, or if it has been tampered with.
- We have little knowledge about the state of previous revision before the current revision was created (was it e.g.
StatusFailed
?), as any "previous" revision gets aStatusSuperseded
from Helm. - Our "revision bookkeeping" model would make less sense, as we would detect mutations but allow rollbacks to the previous state if an upgrade on top of this detection fails. Given we then mark the revision of the rolled back release as the new revision, the controller basically would verify state it did not create itself.
|
||
// Observe the last release. If this fails, we likely encountered a | ||
// transient error and should return it to requeue a reconciliation. | ||
rls, err := run.ObserveLastRelease(hr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still account for out-of-band helm storage updates e.g. via manual helm
commands? It seems like it is now only aware of the releases made by the controller itself, in which case we can't be sure it's really the latest release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObserveLastRelease
retrieves the last release from the Helm storage, and this may return a manually made release, the "cached" version is returned by GetLastPersistedRelease
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got confused between ObserveLastRelease
and GetLastObservedRelease
whose names make it seem like they are more closely related than they are.
|
||
// Delete deletes a release or returns driver.ErrReleaseNotFound. | ||
func (o *Observer) Delete(key string) (*release.Release, error) { | ||
return o.Driver.Delete(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set o.release = nil
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely not, because that would result in garbage collection of the cached release when the max number of revisions is reached if we do not take the key
somehow into account. Sadly, the knowledge about the key format is private, which is why I was hesitant in utilizing it.
What we may be able to do is to store the o.release
as a key/value pair, that may also be an enabler for #166 (comment), if we add an additional deleted
flag to keep the release in cache while maintaining knowledge about the current state in the Helm storage.
// If this release was already marked as successful, | ||
// we have nothing to do. | ||
if hr.Status.LastReleaseRevision == hr.Status.LastSuccessfulReleaseRevision { | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we always be setting requeueAfter: hr.Spec.Interval.Duration
here and elsewhere when there is nothing to do, to ensure we don't end up using whatever the default value is? Or alternatively should we override to that duration after running each step if there were no errors and the requeueAfter is zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter has my preference I think.
Signed-off-by: Hidde Beydals <[email protected]>
560eb6f
to
c5c58a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still going through this, but if I understand correctly (didn't actually test), the install/upgrade retry exhaustion now no longer applies to test and remediation actions. So we can now have infinite retries of remediation actions (uninstall/rollback), which could cause the release revisions to rev to infinity, in the case of repeatedly failed rollbacks. Also when a release is not remediated, either intentionally (remediateLastFailure=false
), or has a failed remediation, this can lead to infinite test retries as well, since LastSuccessfulReleaseRevision != LastReleaseRevision
, which could lead to silently ignoring test failures when a subsequent one succeeds, which may be ok but one probably wants a way to prevent that. It also may lead to problems if one is not using the appropriate hook deletion policies on their helm tests. It also seems like we could be testing an old release we rolled back to instead of the current desired state.
} | ||
|
||
func (r *HelmReleaseReconciler) reconcileTest(ctx context.Context, log logr.Logger, run *runner.Runner, hr *v2.HelmRelease) (ctrl.Result, error) { | ||
// If this release was already marked as successful, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If this release was already marked as successful, | |
// If the last release made by the controller was already marked as successful, |
I think this makes it more clear that the release may have been from a previous reconciliation, not necessarily this one since there may have been no install/upgrade needed, due to no changes or exhausted retries, or the install/upgrade may have failed without creating a release due to e.g. a chart rendering issue.
@seaneagan noted, and thank you for taking the time. I am open to suggestions for improvements and/or alternative approaches, as the issues you have brought up do seem valid (but preventable) at this moment (although it is late here, and I need to reweigh them again tomorrow with fresh energy). Given the goal of this PR is to improve testability, my next item on the list for this PR is to create Go tests for the reconciler actions so that we can properly test and cover the uncertainties you describe. First step to achieve this is to untangle some of the Besides this, I will make sure to rewrite some of the in-code doc blocks based on your comments and questions, as they will probably help others whenever they are trying to understand how it all works. |
return ctrl.Result{RequeueAfter: hr.Spec.Interval.Duration}, nil | ||
// We have exhausted our retries. | ||
case remediation.RetriesExhausted(*hr): | ||
v2.HelmReleaseNotReady(hr, meta.ReconciliationFailedReason, "exhausted release retries") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we were reflecting the Released
condition reason here, which is useful expecially for tooling which looks specifically at the Ready
reason, and outputs its reason/message, specifically kstatus and kpt.
e1cf906
to
29a8e9f
Compare
This makes it possible to inject a custom Helm action configuration, which is useful for tests where you want to be able to inject a (fake) test client and/or different (memory) storage driver. Signed-off-by: Hidde Beydals <[email protected]>
29a8e9f
to
55febb2
Compare
I think this actually all captured by the state observations we do on the release object itself? https://github.com/fluxcd/helm-controller/blob/release-refactor/controllers/helmrelease_controller_release.go#L183-L200 |
Sorry, not sure how I missed that. One scenario I'm still not sure about, if we have say 5 upgrade retries enabled, and the first upgrade fails, and then the rollback repeatedly fails, the upgrade retries will thus be skipped, and thus the release revision will continue to rev to infinity for each new failed rollback. I assume solving this would mean adding a retry limit (whether configurable or not) and failure count status field for rollbacks. I don't think uninstall has the same issue since a failure to uninstall shouldn't rev the release revision, although perhaps it makes sense to stop retrying at some point anyways, not sure.
Missed that as well, thanks. |
Closing in favor or work started in #477. Thank you, the discussion have been fruit for thought. |
No description provided.