[core][rdt] Asyncio / out of order actor RDT support #58575

dayshah · 2025-11-12T22:19:21Z

Description

Before a simple script like this would hang forever on asyncio actors or any actor that doesn't enforce submission order task execution.

@ray.remote(num_gpus=1)
class TestActor:
    @ray.method(tensor_transport="nccl")
    def send(self, data):
        return data.to("cuda")

    @ray.method(tensor_transport="nccl")
    def intermediate(self, gpu_data):
        return gpu_data

    async def recv(self, gpu_data):
        pass

actors = [TestActor.remote() for _ in range(3)]
create_collective_group(actors, backend="nccl")

data = torch.tensor([1, 2, 3])
send_ref = actors[0].send.remote(data)
int_ref = actors[1].intermediate.remote(send_ref)
recv_ref = actors[2].recv.remote(int_ref)
print("done", ray.get(recv_ref))

This is because the intermediate actor would be a receiver for the 1st obj and a sender for the 2nd obj. This means we'll submit a __ray_recv__ for obj 1 and __ray_get_tensor_transport_metadata__ for obj 2. Both tasks execute on the same _ray_system concurrency group and therefore can block the other from executing. On an out of order actor __ray_get_tensor_transport_metadata__ for obj 2 can start executing before the recv and it will wait until obj 2 arrives in the gpu object store. And the creation of obj 2 relies on obj 1 being received, so it'll just wait forever.

The solution here is to simply just put __ray_get_tensor_transport_metadata__ on a separate concurrency group so it doesn't block any recv's.

Note that the nixl abort PR #56783 actually makes it so that I can't repro the hang reliably anymore because the recv usually starts executing before the metadata get now that we pass tensor_transport_meta as a list of refs instead of just the ref. Passing tensor_transport_meta directly to recv will make it consistently hang on master.

ray/python/ray/experimental/gpu_object_manager/gpu_object_manager.py

Lines 507 to 509 in df65225

    
           __ray_recv__, 
        
           obj_id, 
        
           [tensor_transport_meta],

Related issues

#56398

Signed-off-by: dayshah <[email protected]>

gemini-code-assist

Code Review

This pull request effectively resolves a potential deadlock in Ray's Direct Transport (RDT) with asyncio actors by introducing a dedicated concurrency group, _ray_system_rdt_metadata, for metadata fetching tasks. This prevents them from blocking receive operations. The renaming of the error-related concurrency group to _ray_system_rdt_error also improves clarity. The addition of the new test file, test_rdt_all_transports.py, is excellent as it specifically covers the chain of async actor calls that could trigger this deadlock. I've found one minor issue in the new test file that should be addressed.

python/ray/tests/gpu_objects/test_rdt_all_transports.py

Signed-off-by: dayshah <[email protected]>

stephanie-wang · 2025-11-13T17:40:56Z

This is because the intermediate actor would be a receiver for the 1st obj and a sender for the 2nd obj. This means we'll submit a ray_recv for obj 1 and ray_get_tensor_transport_metadata for obj 2. Both tasks execute on the same _ray_system concurrency group and therefore can block the other from executing. On an out of order actor ray_get_tensor_transport_metadata for obj 2 can start executing before the recv and it will wait until obj 2 arrives in the gpu object store. And the creation of obj 2 relies on obj 1 being received, so it'll just wait forever.

Hmm I'm confused by this explanation - shouldn't ray_recv and ray_get_tensor_transport_metadata execute in order of submission since they're on the same concurrency group? Will it get fixed as long as we ensure that ray_recv will execute before ray_get_tensor_transport_metadata? I'm wondering about that because if there's no ordering guarantee it will be hard to guarantee no deadlock in other situations.

Qiaolin-Yu

I think it looks good to me, but will this introduce new race conditions? Have you tested high concurrency test cases?

dayshah · 2025-11-13T22:13:15Z

Hmm I'm confused by this explanation - shouldn't ray_recv and ray_get_tensor_transport_metadata execute in order of submission since they're on the same concurrency group?

Once the actor is considered out of order (async / threaded) it uses the OutOfOrderSchedulingQueue and we throw out any ordering semantics and just post the first task that has ready arguments to its appropriate concurrency group / fiber.

Will it get fixed as long as we ensure that ray_recv will execute before ray_get_tensor_transport_metadata? I'm wondering about that because if there's no ordering guarantee it will be hard to guarantee no deadlock in other situations.

Ya, also scared of other deadlocking situations due to ray_get_tensor_transport_metadata. I think that's why just having the separate concurrency group for ray_get_tensor_transport_metadata is a better solution. It will never block on starting up since it has no obj ref args. It only relies on the task on the source actor finishing which will be on the main concurrency group.

And since ray_get_tensor_transport_metadata is always guaranteed to finish, all sends and recvs will always eventually be unblocked and start running since that's their only obj ref arg. The one concerning thing is deadlocks between sends and recvs since you could have two transfers on the same two actors and you could end up with something where

send actor blocked on send for obj 1
recv actor blocked on recv for obj 2 because it started executing before recv for obj 1

dayshah · 2025-11-14T17:38:06Z

I think it looks good to me, but will this introduce new race conditions? Have you tested high concurrency test cases?

like high concurrency actors or just like lots of interleaving transfers between 3 actors?

github-actions · 2025-11-29T00:37:59Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

dayshah · 2025-12-04T08:53:03Z

Closing in favor of a different approach to support out of order actors fully for one-sided transports

[core][rdt] Asyncio / out of order actor RDT support

eebf8c2

Signed-off-by: dayshah <[email protected]>

dayshah assigned stephanie-wang and Qiaolin-Yu Nov 12, 2025

dayshah requested a review from a team as a code owner November 12, 2025 22:19

dayshah added the go add ONLY when ready to merge, run all tests label Nov 12, 2025

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

python/ray/tests/gpu_objects/test_rdt_all_transports.py Show resolved Hide resolved

only run the test file once on gpus

54346ce

Signed-off-by: dayshah <[email protected]>

ray-gardener bot added core Issues that should be addressed in Ray Core gpu GPU related issues labels Nov 13, 2025

Qiaolin-Yu approved these changes Nov 13, 2025

View reviewed changes

Qiaolin-Yu reviewed Nov 13, 2025

View reviewed changes

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 29, 2025

dayshah closed this Dec 4, 2025

dayshah deleted the rdt-asyncio branch December 4, 2025 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][rdt] Asyncio / out of order actor RDT support #58575

[core][rdt] Asyncio / out of order actor RDT support #58575

Uh oh!

dayshah commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

stephanie-wang commented Nov 13, 2025

Uh oh!

Qiaolin-Yu left a comment

Uh oh!

dayshah commented Nov 13, 2025 •

edited

Loading

Uh oh!

dayshah commented Nov 14, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

dayshah commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[core][rdt] Asyncio / out of order actor RDT support #58575

[core][rdt] Asyncio / out of order actor RDT support #58575

Uh oh!

Conversation

dayshah commented Nov 12, 2025

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

stephanie-wang commented Nov 13, 2025

Uh oh!

Qiaolin-Yu left a comment

Choose a reason for hiding this comment

Uh oh!

dayshah commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dayshah commented Nov 14, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

dayshah commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dayshah commented Nov 13, 2025 •

edited

Loading