Skip to content

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Nov 12, 2025

Description

Before a simple script like this would hang forever on asyncio actors or any actor that doesn't enforce submission order task execution.

@ray.remote(num_gpus=1)
class TestActor:
    @ray.method(tensor_transport="nccl")
    def send(self, data):
        return data.to("cuda")

    @ray.method(tensor_transport="nccl")
    def intermediate(self, gpu_data):
        return gpu_data

    async def recv(self, gpu_data):
        pass

actors = [TestActor.remote() for _ in range(3)]
create_collective_group(actors, backend="nccl")

data = torch.tensor([1, 2, 3])
send_ref = actors[0].send.remote(data)
int_ref = actors[1].intermediate.remote(send_ref)
recv_ref = actors[2].recv.remote(int_ref)
print("done", ray.get(recv_ref))

This is because the intermediate actor would be a receiver for the 1st obj and a sender for the 2nd obj. This means we'll submit a __ray_recv__ for obj 1 and __ray_get_tensor_transport_metadata__ for obj 2. Both tasks execute on the same _ray_system concurrency group and therefore can block the other from executing. On an out of order actor __ray_get_tensor_transport_metadata__ for obj 2 can start executing before the recv and it will wait until obj 2 arrives in the gpu object store. And the creation of obj 2 relies on obj 1 being received, so it'll just wait forever.

The solution here is to simply just put __ray_get_tensor_transport_metadata__ on a separate concurrency group so it doesn't block any recv's.

Note that the nixl abort PR #56783 actually makes it so that I can't repro the hang reliably anymore because the recv usually starts executing before the metadata get now that we pass tensor_transport_meta as a list of refs instead of just the ref. Passing tensor_transport_meta directly to recv will make it consistently hang on master.

__ray_recv__,
obj_id,
[tensor_transport_meta],

Related issues

#56398

@dayshah dayshah requested a review from a team as a code owner November 12, 2025 22:19
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Nov 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a potential deadlock in Ray's Direct Transport (RDT) with asyncio actors by introducing a dedicated concurrency group, _ray_system_rdt_metadata, for metadata fetching tasks. This prevents them from blocking receive operations. The renaming of the error-related concurrency group to _ray_system_rdt_error also improves clarity. The addition of the new test file, test_rdt_all_transports.py, is excellent as it specifically covers the chain of async actor calls that could trigger this deadlock. I've found one minor issue in the new test file that should be addressed.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core gpu GPU related issues labels Nov 13, 2025
@stephanie-wang
Copy link
Contributor

This is because the intermediate actor would be a receiver for the 1st obj and a sender for the 2nd obj. This means we'll submit a ray_recv for obj 1 and ray_get_tensor_transport_metadata for obj 2. Both tasks execute on the same _ray_system concurrency group and therefore can block the other from executing. On an out of order actor ray_get_tensor_transport_metadata for obj 2 can start executing before the recv and it will wait until obj 2 arrives in the gpu object store. And the creation of obj 2 relies on obj 1 being received, so it'll just wait forever.

Hmm I'm confused by this explanation - shouldn't ray_recv and ray_get_tensor_transport_metadata execute in order of submission since they're on the same concurrency group? Will it get fixed as long as we ensure that ray_recv will execute before ray_get_tensor_transport_metadata? I'm wondering about that because if there's no ordering guarantee it will be hard to guarantee no deadlock in other situations.

Copy link
Member

@Qiaolin-Yu Qiaolin-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good to me, but will this introduce new race conditions? Have you tested high concurrency test cases?

@dayshah
Copy link
Contributor Author

dayshah commented Nov 13, 2025

Hmm I'm confused by this explanation - shouldn't ray_recv and ray_get_tensor_transport_metadata execute in order of submission since they're on the same concurrency group?

Once the actor is considered out of order (async / threaded) it uses the OutOfOrderSchedulingQueue and we throw out any ordering semantics and just post the first task that has ready arguments to its appropriate concurrency group / fiber.

Will it get fixed as long as we ensure that ray_recv will execute before ray_get_tensor_transport_metadata? I'm wondering about that because if there's no ordering guarantee it will be hard to guarantee no deadlock in other situations.

Ya, also scared of other deadlocking situations due to ray_get_tensor_transport_metadata. I think that's why just having the separate concurrency group for ray_get_tensor_transport_metadata is a better solution. It will never block on starting up since it has no obj ref args. It only relies on the task on the source actor finishing which will be on the main concurrency group.

And since ray_get_tensor_transport_metadata is always guaranteed to finish, all sends and recvs will always eventually be unblocked and start running since that's their only obj ref arg. The one concerning thing is deadlocks between sends and recvs since you could have two transfers on the same two actors and you could end up with something where

  1. send actor blocked on send for obj 1
  2. recv actor blocked on recv for obj 2 because it started executing before recv for obj 1

@dayshah
Copy link
Contributor Author

dayshah commented Nov 14, 2025

I think it looks good to me, but will this introduce new race conditions? Have you tested high concurrency test cases?

like high concurrency actors or just like lots of interleaving transfers between 3 actors?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 29, 2025
@dayshah
Copy link
Contributor Author

dayshah commented Dec 4, 2025

Closing in favor of a different approach to support out of order actors fully for one-sided transports

@dayshah dayshah closed this Dec 4, 2025
@dayshah dayshah deleted the rdt-asyncio branch December 4, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests gpu GPU related issues stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants