You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a client immediately cancels a task it just submitted, the scheduler will sometimes return a TaskStatus.NotFound message, and log a error message such as:
The bug will systematically occur if the scheduler does not have any worker registered. It can also occur with registered workers, but with a lower probability.
The scheduler maintains two collections for tasks, _unassigned and _running.
As the assign_task_to_worker() might be blocking (if there is no worker), or yield to another async task, there is a possibility that the task being scheduled is removed from _unassigned but not yet added to _running. A TaskCancel message will thus raise a task not found error as the task is in neither of these collections.
Possible fix
Adding the task to _running before assigning it to a worker does not work, as it breaks task cancellation: the scheduler does not know to which worker to route the TaskCancel message.
Instead, the scheduler could in this order:
1. acquire an available worker; 2. wait on the task queue for the next task; 3. immediately add the task to _running; 4. send the Task object to the acquired worker.
The problem is due to the scheduler waiting on two blocking queues: the task queue and the worker queue.
@sharpener6 As I said, I'd like to make the worker queue non-blocking while implementing task tagging (#32). This will remove this bug, and will simplify task scheduling.
This requires a small behavior change: when no worker is connected to the scheduler, the client will receive a TaskStatus.NoWorker error, while currently the task is queued until a worker connects.
When a client immediately cancels a task it just submitted, the scheduler will sometimes return a
TaskStatus.NotFound
message, and log a error message such as:How to reproduce
The bug will systematically occur if the scheduler does not have any worker registered. It can also occur with registered workers, but with a lower probability.
Cause
The bug is in the implementation of the scheduler's task manager routine:
scaler/scaler/scheduler/task_manager.py
Lines 55 to 67 in f4b4daa
The scheduler maintains two collections for tasks,
_unassigned
and_running
.As the
assign_task_to_worker()
might be blocking (if there is no worker), or yield to another async task, there is a possibility that the task being scheduled is removed from_unassigned
but not yet added to_running
. ATaskCancel
message will thus raise a task not found error as the task is in neither of these collections.Possible fix
Adding the task to_running
before assigning it to a worker does not work, as it breaks task cancellation: the scheduler does not know to which worker to route theTaskCancel
message.Instead, the scheduler could in this order:1. acquire an available worker;2. wait on the task queue for the next task;3. immediately add the task to_running
;4. send theTask
object to the acquired worker.This requires some refactoring of the scheduler.EDIT: see #45 (comment) for a more appropriate fix.
The text was updated successfully, but these errors were encountered: