Unit-tests occasionally hang indefinitively #33

rafa-be · 2024-10-11T13:02:16Z

The unit-tests occasionally hang for an infinite period of time, such as in:

Looking at the logs, it seems like TestNestedTask.test_multiple_recursive_task hangs when trying to connect a client to a previously shut down cluster:

2024-10-10T16:35:30.0739865Z [INFO]2024-10-10 16:35:30+0000: SchedulerClusterCombo: shutdown
2024-10-10T16:35:30.0742365Z [INFO]2024-10-10 16:35:30+0000: Cluster: received signal, shutting down
2024-10-10T16:35:30.0743449Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3556]
2024-10-10T16:35:30.0744563Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3558]
2024-10-10T16:35:30.0745707Z [INFO]2024-10-10 16:35:30+0000: Cluster: shutting down worker[3562]
[...]
2024-10-10T16:45:29.8598683Z [EROR]2024-10-10 16:45:29+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:29.8600502Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8601398Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8612268Z [EROR]2024-10-10 16:45:29+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:29.8613466Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:29.8614078Z [INFO]2024-10-10 16:45:29+0000: AsyncConnector: exited
2024-10-10T16:45:30.0387195Z [EROR]2024-10-10 16:45:30+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:30.0388439Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.0389342Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.4484871Z [EROR]2024-10-10 16:45:30+0000: ClientAgent: client timeout when connecting to tcp://127.0.0.1:23456
2024-10-10T16:45:30.4486125Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited
2024-10-10T16:45:30.4487236Z [INFO]2024-10-10 16:45:30+0000: AsyncConnector: exited

The test case computes a recursive Fibonacci sequence (with n = 8), which has a moderately heavy load on Scaler (67 tasks, with up to 7 levels of nested tasks).

It's not clear what the problem is, but I'm suspecting a few possible things:

Some concurrency bug in the nested task mechanism;
Clients running inside the processors might not terminate properly, preventing the cluster to shutdown;
Some unexpected and buggy interaction between Scaler's balancing and nested task mechanisms;
The OS/GitHub Action runtime triggers a premature shutdown of the cluster by sending a SIGINT signal.

The text was updated successfully, but these errors were encountered:

sharpener6 · 2024-10-12T14:53:19Z

Yes, I experienced the same today, not always, that's a bug for sure

rafa-be · 2024-10-14T09:48:49Z

Yes, I'm investigating this further.

rafa-be · 2024-10-15T11:14:26Z

I can reproduce the error locally. During the cluster shutdown, this assertion fails:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 78, in run
    self.__run_forever()
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 181, in __run_forever
    self._loop.run_until_complete(self._task)
  File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/rapha/scaler/scaler/worker/worker.py", line 161, in __get_loops
    await asyncio.gather(
  File "/Users/rapha/scaler/scaler/utility/event_loop.py", line 39, in loop
    await routine()
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 93, in routine
    await self._binder_internal.routine()
  File "/Users/rapha/scaler/scaler/io/async_binder.py", line 52, in routine
    await self._callback(source, message)
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 312, in __on_receive_internal
    await self.__on_internal_task_result(processor_id, message)
  File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 360, in __on_internal_task_result
    await self._task_manager.on_task_result(
  File "/Users/rapha/scaler/scaler/worker/agent/task_manager.py", line 74, in on_task_result
    self._queued_task_ids.remove(result.task_id)
  File "/Users/rapha/scaler/scaler/utility/queues/async_sorted_priority_queue.py", line 45, in remove
    self._queue.remove((item_id, data))
  File "/Users/rapha/scaler/scaler/utility/queues/async_priority_queue.py", line 42, in remove
    assert heapq.heappop(self._queue) == item
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

sharpener6 · 2024-10-15T18:29:11Z

@rafa-be I think this PR still hangs

rafa-be · 2024-10-15T18:43:23Z

Indeed it does... It definitely solves one of the problem, but there is more.

…

________________________________ From: sharpener6 ***@***.***> Sent: Tuesday, October 15, 2024 8:29:32 PM To: Citi/scaler ***@***.***> Cc: Raphael Javaux ***@***.***>; Mention ***@***.***> Subject: Re: [Citi/scaler] Unit-tests occasionally hang indefinitively (Issue #33) @rafa-be<https://github.com/rafa-be> I think this PR still hangs — Reply to this email directly, view it on GitHub<#33 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BKSROOHSO3P6YCGARFRDICLZ3VNIZAVCNFSM6AAAAABPY4H5SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUG4ZDQNBWG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

rafa-be · 2024-10-21T18:15:36Z

I've made progress on the issue.

The problem occurs when a not-yet-initialized processor gets suspended.

When this occurs, the current implementation does not produce any error but does not guarantee that the higher priority task will be executed first, which in some very specific conditions, will produce a dead-lock.

I'm working on the fix ATM.

…phase. Signed-off-by: rafa-be <[email protected]>

…zation phase. Signed-off-by: rafa-be <[email protected]>

…nitialization phase. Signed-off-by: rafa-be <[email protected]>

…zation phase. Signed-off-by: rafa-be <[email protected]>

…phase. Signed-off-by: rafa-be <[email protected]>

…ynchronization issue on Linux (Citi#40) * Fixes a bug in the async priority queue when trying to remove a suspended task. Signed-off-by: rafa-be <[email protected]> * Fixes a worker agent crash when trying to profile a zombie process. Signed-off-by: rafa-be <[email protected]> * Fixes Citi#33: processors can be suspended during the initialization phase. Signed-off-by: rafa-be <[email protected]> * The worker's heart-beat manager watches all worker processes, not only the active one. Signed-off-by: rafa-be <[email protected]> * Task priorities are now positive numbers. Signed-off-by: rafa-be <[email protected]> --------- Signed-off-by: rafa-be <[email protected]>

rafa-be added the bug Something isn't working label Oct 11, 2024

rafa-be self-assigned this Oct 11, 2024

rafa-be mentioned this issue Oct 11, 2024

Adds a GitHub Action that publishes the Scaler docs. #31

Merged

rafa-be mentioned this issue Oct 15, 2024

Fixes #33: rewrite processor initialization, fixe task queue and avoid re-entrant issue on Linux #36

Closed

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 21, 2024

Fixes Citi#33: processors can be suspended during the initialization …

b64c04a

…phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

Fixes Citi#33: processors can be suspended during the initialization …

d2088e8

…phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

fixup! Fixes Citi#33: processors can be suspended during the initiali…

dbaa4d6

…zation phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

fixup! fixup! Fixes Citi#33: processors can be suspended during the i…

ba3f044

…nitialization phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

fixup! Fixes Citi#33: processors can be suspended during the initiali…

97d71b5

…zation phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

fixup! Fixes Citi#33: processors can be suspended during the initiali…

b47650d

…zation phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 25, 2024

Fixes Citi#33: processors can be suspended during the initialization …

6446012

…phase. Signed-off-by: rafa-be <[email protected]>

rafa-be added a commit to rafa-be/scaler that referenced this issue Oct 29, 2024

Fixes Citi#33: processors can be suspended during the initialization …

34d5bd2

…phase. Signed-off-by: rafa-be <[email protected]>

sharpener6 closed this as completed in a770072 Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit-tests occasionally hang indefinitively #33

Unit-tests occasionally hang indefinitively #33

rafa-be commented Oct 11, 2024

sharpener6 commented Oct 12, 2024 •

edited

Loading

rafa-be commented Oct 14, 2024

rafa-be commented Oct 15, 2024

sharpener6 commented Oct 15, 2024

rafa-be commented Oct 15, 2024 via email

rafa-be commented Oct 21, 2024

Unit-tests occasionally hang indefinitively #33

Unit-tests occasionally hang indefinitively #33

Comments

rafa-be commented Oct 11, 2024

sharpener6 commented Oct 12, 2024 • edited Loading

rafa-be commented Oct 14, 2024

rafa-be commented Oct 15, 2024

sharpener6 commented Oct 15, 2024

rafa-be commented Oct 15, 2024 via email

rafa-be commented Oct 21, 2024

sharpener6 commented Oct 12, 2024 •

edited

Loading