-
Notifications
You must be signed in to change notification settings - Fork 8.5k
Description
We have identified an issue where an Alert can force itself into a Zombie state where it's underlying task abort it's retry logic and remains forever in a failed state.
The underlying reason is that Alerting doesn't currently use Task Manager's schedule (we have an issue for it #46001) and instead uses its own internal schedule, meaning Task Manager doesn't classify the Alerting Task as a recurring task. As things stand, only recurring tasks are allowed to retry indefinitely in Task Manager.
As I can see it, there are two options I can see for addressing this:
-
Address issue Convert alerts to use task manager intervals #46001, which is now somewhat more feasible thanks to the work we did to support
runNow. We could allow TaskManager to claim a task for the sole purpose of updating it'sschedule, which would side step the issue we had previously encountered, but could result in long update requests as Task Manager might have to wait for tasks to become free for claiming. This is still not an ideal solution (I have toyed with other ideas, such as spawning a Task whose job it is to update another task once it becomes free [but this is not simple, as there's potential for multiples of these... taskpocalypse waiting to happen 🤣]), but a feasible one. -
We could allow a TaskType to have an infinite number of tries, and what that would mean is that when the task fails it will continue to retry for ever until it succeeds again. The nice thing is that we already have the mechanism in place in Task Manager to space these attempts out further and further as it continues to fail (it adds 5 minutes per failure). This spacing out is reset once it succeeds, so it has relatively graceful way of handling this failure mode.
My instinct, due to the looming 7.6 release, is to go with approach Option 2 and then prioritise Option 1 for 7.7.
The reason being that I'm not sure we can figure out Option 1 in time for 7.6, but as things stand we have two different implementations of scheduling for things being run by TM (one in TM itself and another in Alerting) and that makes things complicated and harder to maintain, so we would still want to address this at some point.
Any thoughts?