-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core][rdt] Asyncio / out of order actor RDT support #58575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a potential deadlock in Ray's Direct Transport (RDT) with asyncio actors by introducing a dedicated concurrency group, _ray_system_rdt_metadata, for metadata fetching tasks. This prevents them from blocking receive operations. The renaming of the error-related concurrency group to _ray_system_rdt_error also improves clarity. The addition of the new test file, test_rdt_all_transports.py, is excellent as it specifically covers the chain of async actor calls that could trigger this deadlock. I've found one minor issue in the new test file that should be addressed.
Signed-off-by: dayshah <[email protected]>
Hmm I'm confused by this explanation - shouldn't ray_recv and ray_get_tensor_transport_metadata execute in order of submission since they're on the same concurrency group? Will it get fixed as long as we ensure that ray_recv will execute before ray_get_tensor_transport_metadata? I'm wondering about that because if there's no ordering guarantee it will be hard to guarantee no deadlock in other situations. |
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good to me, but will this introduce new race conditions? Have you tested high concurrency test cases?
Once the actor is considered out of order (async / threaded) it uses the OutOfOrderSchedulingQueue and we throw out any ordering semantics and just post the first task that has ready arguments to its appropriate concurrency group / fiber.
Ya, also scared of other deadlocking situations due to ray_get_tensor_transport_metadata. I think that's why just having the separate concurrency group for And since ray_get_tensor_transport_metadata is always guaranteed to finish, all sends and recvs will always eventually be unblocked and start running since that's their only obj ref arg. The one concerning thing is deadlocks between sends and recvs since you could have two transfers on the same two actors and you could end up with something where
|
like high concurrency actors or just like lots of interleaving transfers between 3 actors? |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Closing in favor of a different approach to support out of order actors fully for one-sided transports |
Description
Before a simple script like this would hang forever on asyncio actors or any actor that doesn't enforce submission order task execution.
This is because the intermediate actor would be a receiver for the 1st obj and a sender for the 2nd obj. This means we'll submit a
__ray_recv__for obj 1 and__ray_get_tensor_transport_metadata__for obj 2. Both tasks execute on the same_ray_systemconcurrency group and therefore can block the other from executing. On an out of order actor__ray_get_tensor_transport_metadata__for obj 2 can start executing before the recv and it will wait until obj 2 arrives in the gpu object store. And the creation of obj 2 relies on obj 1 being received, so it'll just wait forever.The solution here is to simply just put
__ray_get_tensor_transport_metadata__on a separate concurrency group so it doesn't block any recv's.Note that the nixl abort PR #56783 actually makes it so that I can't repro the hang reliably anymore because the recv usually starts executing before the metadata get now that we pass
tensor_transport_metaas a list of refs instead of just the ref. Passingtensor_transport_metadirectly to recv will make it consistently hang on master.ray/python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Lines 507 to 509 in df65225
Related issues
#56398