[Discuss] Resurrect Zombie Alert tasks 

We have identified an issue where an Alert can force itself into a Zombie state where it's underlying task abort it's retry logic and remains forever in a failed state.

The underlying reason is that Alerting doesn't currently use Task Manager's `schedule` (we have an issue for it https://github.com/elastic/kibana/issues/46001) and instead uses its own internal `schedule`, meaning Task Manager doesn't classify the Alerting Task as a recurring task. As things stand, only recurring tasks are allowed to retry indefinitely in Task Manager.

As I can see it, there are two options I can see for addressing this:

1. Address issue https://github.com/elastic/kibana/issues/46001, which is now somewhat more feasible thanks to the work we did to support `runNow`. We could allow TaskManager to claim a task for the sole purpose of updating it's `schedule`, which would side step the issue we had previously encountered, but could result in long update requests as Task Manager might have to wait for tasks to become free for claiming. This is still not an ideal solution (I have toyed with other ideas, such as spawning a Task whose job it is to update another task once it becomes free [but this is not simple, as there's potential for multiples of these... taskpocalypse waiting to happen 🤣]), but a feasible one.


2. We could allow a TaskType to have an infinite number of tries, and what that would mean is that when the task fails it will continue to retry for ever until it succeeds again. The nice thing is that we already have the mechanism in place in Task Manager to space these attempts out further and further as it continues to fail (it adds 5 minutes per failure). This spacing out is reset once it succeeds, so it has relatively graceful way of handling this failure mode.

My instinct, due to the looming 7.6 release, is to go with approach _Option 2_ and then prioritise _Option 1_ for 7.7.
The reason being that I'm not sure we can figure out _Option 1_ in time for 7.6, but as things stand we have two different implementations of scheduling for things being run by TM (one in TM itself and another in Alerting) and that makes things complicated and harder to maintain, so we would still want to address this at some point.

Any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discuss] Resurrect Zombie Alert tasks #53603

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discuss] Resurrect Zombie Alert tasks #53603

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions