Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary copy in TensorSource #8849

Merged
merged 9 commits into from
Mar 26, 2025

Conversation

lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Mar 18, 2025

Avoid at::Tensor copy in TensorSource if it's not necessary.

The copy operations are needed under 2 cases:

  1. On XLA:GPU path, if tensor is on CUDA device, need to copy to CPU then pass to PJRT runtime to transfer to GPU. This is because @ysiraichi found passing CUDA tensor to PJRT runtime doesn't work, so doing the roundtrip as a workaround.
  2. On XLA:TPU path, if the tensor is not contiguous, need to use the copy to make the memory contiguous. Because PJRT takes raw data ptr which expects data to be contiguous.

The copy operation needs to be blocking, since the transfer operation depends on the copied tensor.

@lsy323 lsy323 changed the title avoid unnecessary copy in tensorsource Avoid unnecessary copy in TensorSource Mar 19, 2025
@lsy323 lsy323 force-pushed the lsiyuan/avoid-blocking-copy-tensorsource branch 2 times, most recently from 2b3b31f to 636a787 Compare March 19, 2025 17:24
@lsy323 lsy323 force-pushed the lsiyuan/avoid-blocking-copy-tensorsource branch from 636a787 to e483f51 Compare March 19, 2025 17:25
@lsy323 lsy323 marked this pull request as ready for review March 19, 2025 17:26
@lsy323 lsy323 requested review from ysiraichi and yaochengji March 19, 2025 17:26
@lsy323
Copy link
Collaborator Author

lsy323 commented Mar 19, 2025

Hi @ysiraichi, just follow up on offline discussion on the copy operation. PTAL at the PR, thanks!

Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, we can use the DLPack machinery for doing the CUDA to XLA:CUDA transfer (that wasn't implemented at the time I worked on this). I will open an issue for this.

Comment on lines 57 to 73
// The purposes of copy are:
// 1. Ensure the memory is contiguous, which is expected by PJRT.
// 2. Move CUDA tensor to CPU since we cannot pass CUDA memory to PJRT now.
// 3. Cast data type.
// We can avoid if copy is not needed.
if (tensor.device() == at::kCPU && tensor.is_contiguous() &&
tensor.dtype() == target_torch_type) {
tensor_ = std::move(tensor);
} else {
// TODO(ysiraichi): check, first, if tensor lives in a device that the
// current PjRt client has access. If so, we don't need to go through the
// CPU.
tensor_ = std::move(tensor.to(
at::TensorOptions().device(at::kCPU).dtype(target_torch_type),
/*non_blocking=*/false,
/*copy=*/true, at::MemoryFormat::Contiguous));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, tensor.to(...) (without the copy argument) already checks whether it should actually copy or not. So, what do you think of reverting to the old tensor.to(...) usage, but removing the copy argument, instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ysiraichi, I didn't find a tensor.to(...) without the copy arg in C++, is it only in python?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. But, I think we can just /* copy= */false.

@ysiraichi
Copy link
Collaborator

In the tensor.to(...) implementation, we can see that if copy=false, using the old implementation, that function will check whether we need to actually copy the tensor. No need to do it ourselves.

Copy link
Collaborator

@pgmoka pgmoka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one small NIT. Otherwise, LGTM. If you could address the NIT before submission, that might be nice.

tensor_ = std::move(tensor);
} else {
TORCH_LAZY_COUNTER("AtenSourceTensorCopy", 1);
// TODO(ysiraichi): check, first, if tensor lives in a device that the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I personally prefer to have TODOs linked to issues, and then having those assigned to people. That way, things can be more easily followed-up if a contributor is no longer active

@lsy323
Copy link
Collaborator Author

lsy323 commented Mar 24, 2025

@ysiraichi @tengyifei Actually took a 2nd thought on this. Skipping the copy seems to be unsafe if the underlying PJRT transfer is async.

def gen_rand_on_xla():
  t = torch.rand(100000)
  t_xla = t.to('xla') # Async transfer start
  return t_xla

my_rand_on_xla = gen_rand_on_xla()
# After the above line, async transfer may not be done, but CPU tensor is already out of scope and may be gc'ed.

@ysiraichi
Copy link
Collaborator

Hmm... I don't think this is the case. That's because PyTorch tensors are ref-counted in the C++ side (unless otherwise specified). So, if we hold a C++ Tensor type data, that won't be deleted.

@ysiraichi
Copy link
Collaborator

By the way, I still don't think you need the if-else there. I believe you can just leave the old tensor.to(...), while setting copy=false. PyTorch should verify whether a copy is needed for us.

Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lsy323 lsy323 merged commit 8dc5b49 into master Mar 26, 2025
23 checks passed
@lsy323
Copy link
Collaborator Author

lsy323 commented Mar 26, 2025

By the way, I still don't think you need the if-else there. I believe you can just leave the old tensor.to(...), while setting copy=false. PyTorch should verify whether a copy is needed for us.

Thank you so much @ysiraichi for the suggestion! Updated accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants