Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit-tests occasionally hang indefinitively #33

Closed
rafa-be opened this issue Oct 11, 2024 · 6 comments
Closed

Unit-tests occasionally hang indefinitively #33

rafa-be opened this issue Oct 11, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@rafa-be
Copy link
Collaborator

rafa-be commented Oct 11, 2024

The unit-tests occasionally hang for an infinite period of time, such as in:

Looking at the logs, it seems like TestNestedTask.test_multiple_recursive_task hangs when trying to connect a client to a previously shut down cluster:

2024-10-10T16:35:30.0739865Z [INFO]2024-10-10 16:35:30+0000: SchedulerClusterCombo: shutdown
2024-10-10T16:35:30.0742365Z [INFO]2024-10-10 16:35:30+0000: Cluster: received signal, shutting down
2024-10-10T16:35:30.0743449Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3556]
2024-10-10T16:35:30.0744563Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3558]
2024-10-10T16:35:30.0745707Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3562]
[...]
2024-10-10T16:45:29.8598683Z [EROR]2024-10-10 16:45:29+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:29.8600502Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8601398Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8612268Z [EROR]2024-10-10 16:45:29+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:29.8613466Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8614078Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:30.0387195Z [EROR]2024-10-10 16:45:30+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:30.0388439Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.0389342Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.4484871Z [EROR]2024-10-10 16:45:30+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:30.4486125Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.4487236Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited

The test case computes a recursive Fibonacci sequence (with n = 8), which has a moderately heavy load on Scaler (67 tasks, with up to 7 levels of nested tasks).

It's not clear what the problem is, but I'm suspecting a few possible things:

  • Some concurrency bug in the nested task mechanism;
  • Clients running inside the processors might not terminate properly, preventing the cluster to shutdown;
  • Some unexpected and buggy interaction between Scaler's balancing and nested task mechanisms;
  • The OS/GitHub Action runtime triggers a premature shutdown of the cluster by sending a SIGINT signal.
@rafa-be rafa-be added the bug Something isn't working label Oct 11, 2024
@rafa-be rafa-be self-assigned this Oct 11, 2024
@sharpener6
Copy link
Collaborator

sharpener6 commented Oct 12, 2024

Yes, I experienced the same today, not always, that's a bug for sure

@rafa-be
Copy link
Collaborator Author

rafa-be commented Oct 14, 2024

Yes, I'm investigating this further.

@rafa-be
Copy link
Collaborator Author

rafa-be commented Oct 15, 2024

I can reproduce the error locally. During the cluster shutdown, this assertion fails:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 78, in run
    self.__run_forever()
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 181, in __run_forever
    self._loop.run_until_complete(self._task)
  File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 161, in __get_loops
    await asyncio.gather(
  File "/Users/rapha/scaler/scaler/utility/event_loop.py", line 39, in loop
    await routine()
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 93, in routine
    await self._binder_internal.routine()
  File "/Users/rapha/scaler/scaler/io/async_binder.py", line 52, in routine
    await self._callback(source, message)
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 312, in __on_receive_internal
    await self.__on_internal_task_result(processor_id, message)
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 360, in __on_internal_task_result
    await self._task_manager.on_task_result(
  File "/Users/rapha/scaler/scaler/worker/agent/task_manager.py", line 74, in on_task_result
    self._queued_task_ids.remove(result.task_id)
  File "/Users/rapha/scaler/scaler/utility/queues/async_sorted_priority_queue.py", line 45, in remove
    self._queue.remove((item_id, data))
  File "/Users/rapha/scaler/scaler/utility/queues/async_priority_queue.py", line 42, in remove
    assert heapq.heappop(self._queue) == item
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

@sharpener6
Copy link
Collaborator

@rafa-be I think this PR still hangs

@rafa-be
Copy link
Collaborator Author

rafa-be commented Oct 15, 2024 via email

@rafa-be
Copy link
Collaborator Author

rafa-be commented Oct 21, 2024

I've made progress on the issue.

The problem occurs when a not-yet-initialized processor gets suspended.

When this occurs, the current implementation does not produce any error but does not guarantee that the higher priority task will be executed first, which in some very specific conditions, will produce a dead-lock.

I'm working on the fix ATM.

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 21, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024
rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 29, 2024
sharpener6 pushed a commit to sharpener6/scaler that referenced this issue Dec 16, 2024
…ynchronization issue on Linux (Citi#40)

* Fixes a bug in the async priority queue when trying to remove a suspended task.

Signed-off-by: rafa-be <[email protected]>

* Fixes a worker agent crash when trying to profile a zombie process.

Signed-off-by: rafa-be <[email protected]>

* Fixes Citi#33: processors can be suspended during the initialization phase.

Signed-off-by: rafa-be <[email protected]>

* The worker's heart-beat manager watches all worker processes, not only the active one.

Signed-off-by: rafa-be <[email protected]>

* Task priorities are now positive numbers.

Signed-off-by: rafa-be <[email protected]>

---------

Signed-off-by: rafa-be <[email protected]>
sharpener6 pushed a commit to sharpener6/scaler that referenced this issue Dec 20, 2024
…ynchronization issue on Linux (Citi#40)

* Fixes a bug in the async priority queue when trying to remove a suspended task.

Signed-off-by: rafa-be <[email protected]>

* Fixes a worker agent crash when trying to profile a zombie process.

Signed-off-by: rafa-be <[email protected]>

* Fixes Citi#33: processors can be suspended during the initialization phase.

Signed-off-by: rafa-be <[email protected]>

* The worker's heart-beat manager watches all worker processes, not only the active one.

Signed-off-by: rafa-be <[email protected]>

* Task priorities are now positive numbers.

Signed-off-by: rafa-be <[email protected]>

---------

Signed-off-by: rafa-be <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants