Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Deserialize torch.Tensors to the correct device #50134

Open
stephanie-wang opened this issue Jan 29, 2025 · 5 comments
Open

[core] Deserialize torch.Tensors to the correct device #50134

stephanie-wang opened this issue Jan 29, 2025 · 5 comments
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability gpu GPU related issues P0 Issues that should be fixed in short order

Comments

@stephanie-wang
Copy link
Contributor

Description

Ray currently serializes torch.Tensors to the object store then deserializes using torch's default deserialization method. This can result in deserialization to the wrong device. Ideally, on deserialization, we should place the tensor directly on the correct device. Currently we do this in Ray Compiled Graphs but we could also support it for all Ray programs (although we cannot eliminate unnecessary copies).

Some questions to consider:

  • We need a way to disable the behavior, both globally and for individual tensors. If a task/actor returns a tensor whose device doesn't match its default device, we can disable the custom deserializer but we probably need another way to override the behavior as well. For example, if a CPU actor passes a CPU tensor to a GPU actor and we want the tensor to remain on the CPU.
  • We need a way to set the default torch.device context in Ray. In Compiled Graphs, this is currently done in the ChannelContext.

Example:

@ray.remote(num_gpus=1)
class Actor:
  def alloc(self):
    return torch.randn(1000, device="cuda")
  def read(self, tensor):
    assert tensor.device

# GPU 0
a = Actor.remote()
# GPU 1
b = Actor.remote()

t = a.alloc.remote()
# Return GPU 0.
ray.get(a.read.remote(t))
# Return GPU 1.
ray.get(b.read.remote(t))
# Driver has no GPUs. Return CPU.
ray.get(a.read.remote(t)).device

Use case

No response

@stephanie-wang stephanie-wang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Jan 29, 2025
@jjyao jjyao added P0 Issues that should be fixed in short order gpu GPU related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 30, 2025
@edoakes
Copy link
Contributor

edoakes commented Jan 30, 2025

@stephanie-wang can you explain more about your thinking re: "we cannot eliminate unnecessary copies" (you've thought about this a lot more than me).

One direction I was thinking is to use the storage._shared_cuda() API to get a CUDA IPC handle for the tensor, then pass it to the downstream actor and reconstruct the tensor pointing at the same device memory. The tricky part is keeping the reference to the tensor alive in the process that produces it, but we should be able to plug into the existing reference counting implementation for that.

@edoakes
Copy link
Contributor

edoakes commented Jan 30, 2025

I think we should also introduce a warning when this happens as it's likely unexpected to the user. Something like "serializing GPU tensor foobar, a CPU copy will be made, to avoid this you can do ..."

@sven1977
Copy link
Contributor

Awesome! Thanks for opening this issue @stephanie-wang .

I think we should also introduce a warning when this happens as it's likely unexpected to the user. Something like "serializing GPU tensor foobar, a CPU copy will be made, to avoid this you can do ..."

Yes, I think the warning makes sense. When I tried it the first time with just one GPU, I thought the tensor would NOT be copied to CPU in between, b/c it magically came right out from the object store on the correct device. That's why my intuition was that direct GPU-tensor handover from actor to actor (who share the same GPU) was already implemented.

@stephanie-wang
Copy link
Contributor Author

The tricky part is keeping the reference to the tensor alive in the process that produces it, but we should be able to plug into the existing reference counting implementation for that.

That's basically the idea behind the proposed GPU support for Ray Core API :) But I want to avoid doing this without the associated API changes because it brings up questions of how we should manage the GPU data on the sending actor:

  • The user may modify the tensor after returning it from the task. In contrast, the object store approach will serialize an immutable copy before returning control to the actor. Not much we can do to get around this even with the new API but at least requiring the user to annotate functions would make them more aware of the problem.
  • We need to buffer GPU data on the sending and receiving actors. If we buffer too much data, we will run out of GPU memory.

I think it would be better to have a dumb, kind of slow, but fully reliable approach for the normal Ray Core API, and then we can improve on it with the GPU-native API. Anyway, it is probably a good idea to have both options for the future since the latter may take some time to stabilize.

@edoakes
Copy link
Contributor

edoakes commented Jan 31, 2025

Agree with all of the above. I'd propose let's:

  1. Fix this with a minimal change for "dumb but correct"
  2. Add the IPC solution to the scope of the GPU<>GPU API proposal
  3. Once (2) is supported, add a warning message for (1) that points at the docs on how to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability gpu GPU related issues P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

4 participants