[core] Deserialize torch.Tensors to the correct device #50134

stephanie-wang · 2025-01-29T23:44:56Z

Description

Ray currently serializes torch.Tensors to the object store then deserializes using torch's default deserialization method. This can result in deserialization to the wrong device. Ideally, on deserialization, we should place the tensor directly on the correct device. Currently we do this in Ray Compiled Graphs but we could also support it for all Ray programs (although we cannot eliminate unnecessary copies).

Some questions to consider:

We need a way to disable the behavior, both globally and for individual tensors. If a task/actor returns a tensor whose device doesn't match its default device, we can disable the custom deserializer but we probably need another way to override the behavior as well. For example, if a CPU actor passes a CPU tensor to a GPU actor and we want the tensor to remain on the CPU.
We need a way to set the default torch.device context in Ray. In Compiled Graphs, this is currently done in the ChannelContext.

Example:

@ray.remote(num_gpus=1)
class Actor:
  def alloc(self):
    return torch.randn(1000, device="cuda")
  def read(self, tensor):
    assert tensor.device

# GPU 0
a = Actor.remote()
# GPU 1
b = Actor.remote()

t = a.alloc.remote()
# Return GPU 0.
ray.get(a.read.remote(t))
# Return GPU 1.
ray.get(b.read.remote(t))
# Driver has no GPUs. Return CPU.
ray.get(a.read.remote(t)).device

Use case

No response

edoakes · 2025-01-30T16:02:23Z

@stephanie-wang can you explain more about your thinking re: "we cannot eliminate unnecessary copies" (you've thought about this a lot more than me).

One direction I was thinking is to use the storage._shared_cuda() API to get a CUDA IPC handle for the tensor, then pass it to the downstream actor and reconstruct the tensor pointing at the same device memory. The tricky part is keeping the reference to the tensor alive in the process that produces it, but we should be able to plug into the existing reference counting implementation for that.

edoakes · 2025-01-30T16:03:27Z

I think we should also introduce a warning when this happens as it's likely unexpected to the user. Something like "serializing GPU tensor foobar, a CPU copy will be made, to avoid this you can do ..."

sven1977 · 2025-01-30T17:19:55Z

Awesome! Thanks for opening this issue @stephanie-wang .

I think we should also introduce a warning when this happens as it's likely unexpected to the user. Something like "serializing GPU tensor foobar, a CPU copy will be made, to avoid this you can do ..."

Yes, I think the warning makes sense. When I tried it the first time with just one GPU, I thought the tensor would NOT be copied to CPU in between, b/c it magically came right out from the object store on the correct device. That's why my intuition was that direct GPU-tensor handover from actor to actor (who share the same GPU) was already implemented.

stephanie-wang · 2025-01-30T18:35:00Z

The tricky part is keeping the reference to the tensor alive in the process that produces it, but we should be able to plug into the existing reference counting implementation for that.

That's basically the idea behind the proposed GPU support for Ray Core API :) But I want to avoid doing this without the associated API changes because it brings up questions of how we should manage the GPU data on the sending actor:

The user may modify the tensor after returning it from the task. In contrast, the object store approach will serialize an immutable copy before returning control to the actor. Not much we can do to get around this even with the new API but at least requiring the user to annotate functions would make them more aware of the problem.
We need to buffer GPU data on the sending and receiving actors. If we buffer too much data, we will run out of GPU memory.

I think it would be better to have a dumb, kind of slow, but fully reliable approach for the normal Ray Core API, and then we can improve on it with the GPU-native API. Anyway, it is probably a good idea to have both options for the future since the latter may take some time to stabilize.

edoakes · 2025-01-31T00:33:58Z

Agree with all of the above. I'd propose let's:

Fix this with a minimal change for "dumb but correct"
Add the IPC solution to the scope of the GPU<>GPU API proposal
Once (2) is supported, add a warning message for (1) that points at the docs on how to do it.

stephanie-wang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Jan 29, 2025

jjyao added P0 Issues that should be fixed in short order gpu GPU related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Deserialize torch.Tensors to the correct device #50134

[core] Deserialize torch.Tensors to the correct device #50134

stephanie-wang commented Jan 29, 2025

edoakes commented Jan 30, 2025 •

edited

Loading

edoakes commented Jan 30, 2025 •

edited

Loading

sven1977 commented Jan 30, 2025

stephanie-wang commented Jan 30, 2025

edoakes commented Jan 31, 2025

[core] Deserialize torch.Tensors to the correct device #50134

[core] Deserialize torch.Tensors to the correct device #50134

Comments

stephanie-wang commented Jan 29, 2025

Description

Use case

edoakes commented Jan 30, 2025 • edited Loading

edoakes commented Jan 30, 2025 • edited Loading

sven1977 commented Jan 30, 2025

stephanie-wang commented Jan 30, 2025

edoakes commented Jan 31, 2025

edoakes commented Jan 30, 2025 •

edited

Loading

edoakes commented Jan 30, 2025 •

edited

Loading