[RFC][Ray Core] Support zero-copy Pytorch tensor in Ray #26229

jiaodong · 2022-06-30T17:47:56Z

Description

Previous work by @suquark that is reverted: #12344

Currently Ray support zero-copy for numpy arrays, but not Pytorch Tensor. This seems to be a feature asked by multiple folks we interacted with from Pytorch team in TorchX and PyTorch Geometric, etc. Similarly we encounter needs within Ray libraries down the road such as AIR (training data ingest) and Serve (ModelMesh, cc: @sihanwang41 ).

From our chat with Yaroslav:

I’d love to have a fast interoperability between Ray/PyTorch. Making training efficient on cheap/pre-emptible instances needs some experimentation, and Ray already has the right abstractions for it. (imagine implementing sync PS with backup workers — easy with Ray actors, hard with RPC interface). In ideal world, one would be able to use Ray to quickly 

1) send PyTorch tensor to another machine
2) receive PyTorch Tensor 
3) (is it possible) receive it in “pinned” memory, so that CPU-GPU transfer could happen without CPU involvement (ie, what DataLoader does)

As I chatted with @suquark , previous implementation is less ideal as we need to import torch, register a custom serializer and suppress warnings from pytorch since when we deserialize pytorch tensor as immutable objects, pytorch would raise a warning say torch tensor cannot be readonly

Our suggestion approach is to enable "numpy read-only" for Pytorch tensors and make Ray-Pytorch fast interoperability cleaner, related issues in Pytorch github are:

pytorch/pytorch#32868
pytorch/pytorch#44027

A bigger related RFC in pytorch is TensorStore: pytorch/pytorch#64932

cc: @yaroslavvb @msaroufim

Use case

AIR training data ingest
Ray Serve

The text was updated successfully, but these errors were encountered:

stale · 2022-10-29T20:30:49Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

AndyBug0 · 2022-11-30T03:26:46Z

It'll benefit us a lot if ray support this issue.

nickchomey · 2023-02-11T21:33:10Z

I'd also like this very much

HuangLED · 2023-02-27T18:24:26Z

+1 to supporting this.

idthanm · 2023-07-17T05:21:37Z

Any progress on this feature?

alialamiidrissi · 2024-01-26T10:30:20Z

This issue seems to be quite old but still very relevant. I found a workaround and I thought it might help others.

array_np = np.zeros((23,40))
obj_ref = ray.put(array_np)
array_np = ray.get(obj_ref)
array_torch = torch.from_numpy(array_np)

Explanation
Ray already supports zero-copy reads for numpy arrays. Also according to pytorch documentation, when using torch.from_numpy, the newly created tensor shares the same memory with the numpy tensor. This means that it will use the same buffer already allocated by the object store instead of creating a new one.
Notes

When running this code, Pytorch will complain about the fact that the numpy array is read only but it seems to still work fine. We can make the numpy array writable using this flag array_np.flags.writeable = True. However, this will break the immutability assumption about the objects in the Ray object store
One can test that the pytorch tensors are using the zero copy memory by using the following code

array_np, array_np_2= ray.get([obj_ref]*2)
array_torch, array_torch_2 = torch.from_numpy(array_np), torch.from_numpy(array_np_2)
array_torch.data_ptr() == array_torch_2.data_ptr()

There is an old merged PR that seems to be solving the same problem but it seems that the code introduced there was overwritten. I manually ran the unit test added in this PR and it failed

jiaodong added enhancement Request for new feature and/or capability RFC RFC issues labels Jun 30, 2022

jiaodong self-assigned this Jun 30, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 29, 2022

jiaodong removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 30, 2022

stephanie-wang mentioned this issue Nov 29, 2022

[core] Memory changes are not as expected when using ray.get() #30615

Open

dlee992 mentioned this issue Jul 21, 2023

The zero-copy behaviour is not valid for np.recarray #37573

Closed

jjyao added the core Issues that should be addressed in Ray Core label Mar 13, 2024

jjyao unassigned jiaodong Mar 13, 2024

jjyao added P1 Issue that should be fixed within a few weeks core-object-store labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Ray Core] Support zero-copy Pytorch tensor in Ray #26229

[RFC][Ray Core] Support zero-copy Pytorch tensor in Ray #26229

jiaodong commented Jun 30, 2022 •

edited

Loading

stale bot commented Oct 29, 2022

AndyBug0 commented Nov 30, 2022

nickchomey commented Feb 11, 2023

HuangLED commented Feb 27, 2023

idthanm commented Jul 17, 2023

alialamiidrissi commented Jan 26, 2024 •

edited

Loading

[RFC][Ray Core] Support zero-copy Pytorch tensor in Ray #26229

[RFC][Ray Core] Support zero-copy Pytorch tensor in Ray #26229

Comments

jiaodong commented Jun 30, 2022 • edited Loading

Description

Use case

stale bot commented Oct 29, 2022

AndyBug0 commented Nov 30, 2022

nickchomey commented Feb 11, 2023

HuangLED commented Feb 27, 2023

idthanm commented Jul 17, 2023

alialamiidrissi commented Jan 26, 2024 • edited Loading

jiaodong commented Jun 30, 2022 •

edited

Loading

alialamiidrissi commented Jan 26, 2024 •

edited

Loading