Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler incorrectly returns a task not found error #45

Open
rafa-be opened this issue Nov 20, 2024 · 1 comment
Open

Scheduler incorrectly returns a task not found error #45

rafa-be opened this issue Nov 20, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@rafa-be
Copy link
Collaborator

rafa-be commented Nov 20, 2024

When a client immediately cancels a task it just submitted, the scheduler will sometimes return a TaskStatus.NotFound message, and log a error message such as:

[EROR]2024-11-20 22:47:58+0100: cannot find task_id=1ad3d9caa78911ef99be965046454255 in task workers

How to reproduce

The bug will systematically occur if the scheduler does not have any worker registered. It can also occur with registered workers, but with a lower probability.

with Client("tcp://127.0.0.1:60338") as client:
    client.submit(math.sqrt, 16).cancel()

Cause

The bug is in the implementation of the scheduler's task manager routine:

async def routine(self):
task_id = await self._unassigned.get()
if not await self._worker_manager.assign_task_to_worker(self._task_id_to_task[task_id]):
await self._unassigned.put(task_id)
return
self._running.add(task_id)
await self.__send_monitor(
task_id,
self._object_manager.get_object_name(self._task_id_to_task[task_id].func_object_id),
TaskStatus.Running,
)

The scheduler maintains two collections for tasks, _unassigned and _running.

As the assign_task_to_worker() might be blocking (if there is no worker), or yield to another async task, there is a possibility that the task being scheduled is removed from _unassigned but not yet added to _running. A TaskCancel message will thus raise a task not found error as the task is in neither of these collections.

Possible fix

Adding the task to _running before assigning it to a worker does not work, as it breaks task cancellation: the scheduler does not know to which worker to route the TaskCancel message.

Instead, the scheduler could in this order:

1. acquire an available worker;
2. wait on the task queue for the next task;
3. immediately add the task to _running;
4. send the Task object to the acquired worker.

This requires some refactoring of the scheduler.

EDIT: see #45 (comment) for a more appropriate fix.

@rafa-be rafa-be added the bug Something isn't working label Nov 20, 2024
@rafa-be
Copy link
Collaborator Author

rafa-be commented Nov 26, 2024

The problem is due to the scheduler waiting on two blocking queues: the task queue and the worker queue.

@sharpener6 As I said, I'd like to make the worker queue non-blocking while implementing task tagging (#32). This will remove this bug, and will simplify task scheduling.

This requires a small behavior change: when no worker is connected to the scheduler, the client will receive a TaskStatus.NoWorker error, while currently the task is queued until a worker connects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant