-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit-tests occasionally hang indefinitively #33
Comments
Yes, I experienced the same today, not always, that's a bug for sure |
Yes, I'm investigating this further. |
I can reproduce the error locally. During the cluster shutdown, this assertion fails: Traceback (most recent call last):
File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/Users/rapha/scaler/scaler/worker/worker.py", line 78, in run
self.__run_forever()
File "/Users/rapha/scaler/scaler/worker/worker.py", line 181, in __run_forever
self._loop.run_until_complete(self._task)
File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/rapha/scaler/scaler/worker/worker.py", line 161, in __get_loops
await asyncio.gather(
File "/Users/rapha/scaler/scaler/utility/event_loop.py", line 39, in loop
await routine()
File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 93, in routine
await self._binder_internal.routine()
File "/Users/rapha/scaler/scaler/io/async_binder.py", line 52, in routine
await self._callback(source, message)
File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 312, in __on_receive_internal
await self.__on_internal_task_result(processor_id, message)
File "/Users/rapha/scaler/scaler/worker/agent/processor_manager.py", line 360, in __on_internal_task_result
await self._task_manager.on_task_result(
File "/Users/rapha/scaler/scaler/worker/agent/task_manager.py", line 74, in on_task_result
self._queued_task_ids.remove(result.task_id)
File "/Users/rapha/scaler/scaler/utility/queues/async_sorted_priority_queue.py", line 45, in remove
self._queue.remove((item_id, data))
File "/Users/rapha/scaler/scaler/utility/queues/async_priority_queue.py", line 42, in remove
assert heapq.heappop(self._queue) == item
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError |
@rafa-be I think this PR still hangs |
Indeed it does...
It definitely solves one of the problem, but there is more.
…________________________________
From: sharpener6 ***@***.***>
Sent: Tuesday, October 15, 2024 8:29:32 PM
To: Citi/scaler ***@***.***>
Cc: Raphael Javaux ***@***.***>; Mention ***@***.***>
Subject: Re: [Citi/scaler] Unit-tests occasionally hang indefinitively (Issue #33)
@rafa-be<https://github.com/rafa-be> I think this PR still hangs
—
Reply to this email directly, view it on GitHub<#33 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BKSROOHSO3P6YCGARFRDICLZ3VNIZAVCNFSM6AAAAABPY4H5SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUG4ZDQNBWG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I've made progress on the issue. The problem occurs when a not-yet-initialized processor gets suspended. When this occurs, the current implementation does not produce any error but does not guarantee that the higher priority task will be executed first, which in some very specific conditions, will produce a dead-lock. I'm working on the fix ATM. |
…phase. Signed-off-by: rafa-be <[email protected]>
…phase. Signed-off-by: rafa-be <[email protected]>
…zation phase. Signed-off-by: rafa-be <[email protected]>
…nitialization phase. Signed-off-by: rafa-be <[email protected]>
…zation phase. Signed-off-by: rafa-be <[email protected]>
…zation phase. Signed-off-by: rafa-be <[email protected]>
…phase. Signed-off-by: rafa-be <[email protected]>
…phase. Signed-off-by: rafa-be <[email protected]>
…ynchronization issue on Linux (Citi#40) * Fixes a bug in the async priority queue when trying to remove a suspended task. Signed-off-by: rafa-be <[email protected]> * Fixes a worker agent crash when trying to profile a zombie process. Signed-off-by: rafa-be <[email protected]> * Fixes Citi#33: processors can be suspended during the initialization phase. Signed-off-by: rafa-be <[email protected]> * The worker's heart-beat manager watches all worker processes, not only the active one. Signed-off-by: rafa-be <[email protected]> * Task priorities are now positive numbers. Signed-off-by: rafa-be <[email protected]> --------- Signed-off-by: rafa-be <[email protected]>
…ynchronization issue on Linux (Citi#40) * Fixes a bug in the async priority queue when trying to remove a suspended task. Signed-off-by: rafa-be <[email protected]> * Fixes a worker agent crash when trying to profile a zombie process. Signed-off-by: rafa-be <[email protected]> * Fixes Citi#33: processors can be suspended during the initialization phase. Signed-off-by: rafa-be <[email protected]> * The worker's heart-beat manager watches all worker processes, not only the active one. Signed-off-by: rafa-be <[email protected]> * Task priorities are now positive numbers. Signed-off-by: rafa-be <[email protected]> --------- Signed-off-by: rafa-be <[email protected]>
The unit-tests occasionally hang for an infinite period of time, such as in:
Looking at the logs, it seems like
TestNestedTask.test_multiple_recursive_task
hangs when trying to connect a client to a previously shut down cluster:The test case computes a recursive Fibonacci sequence (with n = 8), which has a moderately heavy load on Scaler (67 tasks, with up to 7 levels of nested tasks).
It's not clear what the problem is, but I'm suspecting a few possible things:
SIGINT
signal.The text was updated successfully, but these errors were encountered: