Avoiding another operation (install/upgrade/rollback) is in progress
errors
#644
Labels
area/ux
In pursuit of a delightful user experience
enhancement
New feature or request
request for feedback
Feedback is requested from users
🔎 Background
There has been a long outstanding issue with Helm releases hanging on
another operation (install/upgrade/rollback) is in progress
errors.This can occur when Helm locks a release in a
pending-
state while performing an install, upgrade, or rollback, and the controller terminates before the lock is released. The cause of the termination can be due to an OOM kill or because the Pod the controller is running in is shut down while the timeout of the release exceeds the graceful termination duration the Pod is allowed to take.🪄 Avoiding the issue
To help prevent this issue, we have made two changes in
v0.31.x
.Helm action cancellation
The Helm action will now be canceled as soon as a shutdown signal is received, which instructs Helm to stop the action as soon as it can. This change results in the release ending up in a
failed
state instead of apending-*
state.Near OOM watcher
The controller can now be configured to detect when it is nearing an OOM kill. This is enabled by configuring the Deployment with
--feature-gates=OOMWatch=true
.When enabled, the controller will monitor its memory usage as reported by cgroup, and when it is nearing OOM, attempt to gracefully shut down. Releases that are currently being upgraded will be cancelled (resulting in a
failed
release as opposed to apending-*
deadlock), and no new releases will be started.For more information about enabling and configuring this feature, refer to this section in the cheatsheet and the controller option flags.
A bit of advice
All of the above is best combined with a thoughtful configuration of remediation strategies on the
HelmRelease
resources, to ensure that the controller can recover from the failed release.💡 Future enhancements
In a future release, we will add support for unlocking releases that are in a pending state as a different approach to handling OOM situations. For this to be possible, we need a guarantee the release we are unlocking is stuck due to an action we performed ourselves, which is part of a bigger set of changes planned for Q2 2023.
☎️ Request for Feedback
As the (near) OOM watcher is new, we are eager to hear about any issues or suggestions for improvement!
The text was updated successfully, but these errors were encountered: