Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CUDA OpenXLA fallback. #7318

Merged
merged 27 commits into from
Jul 3, 2024
Merged

Introduce CUDA OpenXLA fallback. #7318

merged 27 commits into from
Jul 3, 2024

Conversation

ysiraichi
Copy link
Collaborator

This PR introduces OpenXLA fallback on PyTorch GPU eager. Instead of running fallback operations (i.e. whenever a operation has no lowering implemented) on CPU, we now make it possible to run them on GPU. This makes sense specially when using XLA:CUDA devices.

In summary, this PR introduces the following changes:

  • Rename xla_cpu_fallback into xla_fallback
    • Changes every call site that manually invokes the fallback
  • Implement cuda_fallback function
    • A version of at::native::cpu_fallback, but with a few changes (called out before each function)
    • Ideally, it would be better to generalize at::native::cpu_fallback implementation inside PyTorch, though
  • Add XLA_FALLBACK_CUDA flag for using this feature
  • Add tests for fallback operations that are found in torchbench

cc @miladm @JackCaoG @vanbasten23

@ysiraichi
Copy link
Collaborator Author

I'm still running torchbench. Will report back when it is over.

test/test_ops.py Outdated
@dataclass
class AllowedFallbackOpInfoEntry(AllowedOpInfoEntry):
fallback_ops: List[str] = field(default_factory=list)
allow_sample: Optional[Callable[[SampleInput], bool]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does allow_sample mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It filters the sample list, looking for a specific one. I will leave a comment there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SampleInput is the sample list that you are referring to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SampleInput is the class that represents one sample.

test/test_ops.py Outdated Show resolved Hide resolved
test/test_ops.py Outdated Show resolved Hide resolved
@@ -211,18 +213,6 @@ cc_library(
],
)

ptxla_cc_library(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this change for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, aten_cpu_fallback.cpp needs functions from:

  • aten_xla_bridge.cpp
  • xla_graph_executor.cpp
  • dl_convertor.cpp

So, I thought it would be easier to merge it into the main library.

return runtime::sys_util::GetEnvBool("XLA_FALLBACK_CUDA", false);
}

// Change: use of std::any_of instead of iterating with a for-loop.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment is stale now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is the change that I applied to that function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rephrase Change: to Change made:? Change: sounds it is something we want to change next lol

}

// Synchronizes the CUDA device being used by PyTorch.
static void torch_cuda_synchronize(at::DeviceIndex common_device) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the param common_device mean? The device to be synchronized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

// 1. Track the device index being used. Rationale: we synchronize the device
// before crossing device borders for correctness.
//
void cuda_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found similar implementations in PyTorch, one in pytorch/aten/src/ATen/native/CPUFallback.cpp and the other one in pytorch/torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp. How is this implementation different from those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is that we (a) are falling back to CUDA. Here are more details regarding these 3 implementations:

  • CPUFallback.cpp (b) looks more updated than ts_eager_fallback.cpp (c), e.g. handles mutation in tensor lists
  • Device support: even though (c) does support other devices, its conversion uses tensor.to(device), which copies the tensor. In contrast, (a) we simply share the storage of the tensors
  • Device synchronization: (a) needs further device synchronization, since we are not calling tensor.to method

With all that said, I do agree the functions are all similar to each other. A better approach would be to generalize (b), adding support for some customization.

@ysiraichi
Copy link
Collaborator Author

The CI error is a bit tricky to solve.

Problem: I'm using some CUDA functions defined inside PyTorch, which requires linking libc10_cuda.so to the test binaries. However, since (in CI) PyTorch isn't being compiled with CUDA support, that won't work.

While I could condition compilation of that code with C++ macros (e.g. using XLA_CUDA definition), that would mean that we never compile that code in CI, since PyTorch/XLA is compiled without that flag set + PyTorch is compiled without CUDA support (in that specific CI action).

Possible Solution: create a phony implementation for the CUDA functions I'm using, and compile it to another library. Then, if we don't find the library libc10_cuda.so, we link this other library.

Notice that this is only needed for the test binaries.

@JackCaoG @vanbasten23 @lezcano
What do you think?

@lezcano
Copy link
Collaborator

lezcano commented Jun 21, 2024

We could also always compile PyTorch with CUDA support in CI.

@vanbasten23
Copy link
Collaborator

The CI error is a bit tricky to solve.

Problem: I'm using some CUDA functions defined inside PyTorch, which requires linking libc10_cuda.so to the test binaries. However, since (in CI) PyTorch isn't being compiled with CUDA support, that won't work.

While I could condition compilation of that code with C++ macros (e.g. using XLA_CUDA definition), that would mean that we never compile that code in CI, since PyTorch/XLA is compiled without that flag set + PyTorch is compiled without CUDA support (in that specific CI action).

Possible Solution: create a phony implementation for the CUDA functions I'm using, and compile it to another library. Then, if we don't find the library libc10_cuda.so, we link this other library.

Notice that this is only needed for the test binaries.

@JackCaoG @vanbasten23 @lezcano What do you think?

If it's only the test binary that requires pytorch built with CUDA, there is a way to achieve it. In our CI, there is a workflow that build pytorch with CUDA, build torch_xla with CUDA, and run only those tests that requires pytorch with CUDA:
image
You can add your tests to

PJRT_DEVICE=CUDA python pytorch/xla/test/dynamo/test_dynamo.py -v
.

@@ -31,3 +31,4 @@
WORLD_SIZE = 'WORLD_SIZE'
LOCAL_WORLD_SIZE = 'LOCAL_WORLD_SIZE'
ZERO_COPY_ENABLED = 'ZERO_COPY_ENABLED'
XLA_FALLBACK_CUDA = 'XLA_FALLBACK_CUDA'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we define a plan to make this feature default (e.g. for 2.5 release) and remove the env variable. Wdyt @ysiraichi?

Environment variables need a description here: https://github.com/pytorch/xla/blob/master/configuration.yaml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to defaul XLA:CUDA executions on CUDA fallback, while XLA:CPU and XLA:TPU remain on CPU fallback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, more specifically, after this PR, it would be good to:

Step1: Reversing the function of XLA_FALLBACK_CUDA so the default would be CUDA fallback enabled (e.g. define DISABLE_XLA_FALLBACK_CUDA).

Step2: define a mechanism to remove the env variable and have a native solution that just works, and avoids user experience hiccups.

@vanbasten23
Copy link
Collaborator

For the problem 1 "Problem1: C++ test binaries need all references to be resolved", you mentioned the "Solution: Create a fallback implementation of the CUDA functions". Could you point to me where is the fallback implementation of the CUDA functions?

@miladm
Copy link
Collaborator

miladm commented Jun 24, 2024

@zpcore to upgrade the XLA:GPU benchmarking to adopt CUDA fallback setting after this PR lands.

cc @will-cromar for viz re: comment #7318 (comment)

Comment on lines 1 to 20
#include "torch_xla/csrc/aten_cuda_functions.h"

#include <c10/util/Exception.h>

static void fail(const char* name) {
TORCH_CHECK(false, "Could not call the CUDA function: ", name,
". PyTorch was compiled without CUDA support.");
}

namespace c10::cuda {

c10::DeviceIndex current_device() noexcept { return -1; }

void set_device(c10::DeviceIndex) { fail("c10::cuda::set_device()"); }

void device_synchronize() { fail("c10::cuda::device_synchronize()"); }

} // namespace c10::cuda
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanbasten23 this is the fallback implementation.

@@ -137,6 +142,7 @@ ptxla_cc_test(
":torch_xla_test",
"//torch_xla/csrc/runtime:metrics",
"//torch_xla/csrc:tensor",
"//torch_xla/csrc:aten_cuda_functions",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the fallback implementation for solving the undefined references in C++ tests. I think this should be reasonable, since we don't test fallback on C++ tests.

@ysiraichi
Copy link
Collaborator Author

@will-cromar I'm having a hard time figuring out how to make this PR work with CI. Specifically: compile + run fallback operations test (at test_ops.py).

Context: I'm calling a few PyTorch CUDA functions inside a function in aten_cpu_fallback.cpp. The implementation of these functions live in libc10_cuda.so.

Problem: In the CI action we compile PyTorch/XLA, we actually compile PyTorch and PyTorch/XLA without CUDA support. In other words, libc10_cuda.so is not created.

  • When we try to import torch_xla in that same CI action, it fails because _XLAC has undefined references to the CUDA functions
  • We can't conditionally compile (i.e. #ifdef XLA_CUDA) the CUDA functions, since it would mean CUDA OpenXLA fallback never gets compiled

Proposed Solution: have 2 libraries: _XLAC_cpu (no CUDA OpenXLA fallback) and _XLAC_cuda.

  • Conditionally import either of them, depending on whether PyTorch was compiled with support for CUDA
  • Create an alias like so: import _XLAC_cpu as _XLAC for backwards compatibility

I know this is not a pretty solution, so do you have any suggestions?

@will-cromar
Copy link
Collaborator

Hey @ysiraichi, I'll spend some more time going over this PR tomorrow to try to understand it better.

We were just preparing to remove the separate GPU variant of the main torch_xla package by moving the GPU runtime implementation to a PJRT plugin. PyPI doesn't support any sort of platform tag that would let us release separate stable TPU and GPU variants of the main package. We need to figure out how to build one variant of the torch_xla package so everyone can just pip install torch_xla.

Most of the team that is building from source is doing so on TPUs realistically, so it is a nice convenience to not have to build the CUDA version of PyTorch first. Obviously adding the CUDA torch build to the critical path on the CI will be a significant overhead as well. But if we can use a pre-built PyTorch package somehow, I actually don't mind if we use the regular CUDA torch package as a build dependency, since my main concern is how slow that build is. cc @JackCaoG since we've talked about this possibility a few times but never had a good enough reason to add this option

I don't fully understand after skimming the PR why we need libc10_cuda at build time. Can that be dynamically loaded as needed?

@ysiraichi
Copy link
Collaborator Author

I don't fully understand after skimming the PR why we need libc10_cuda at build time. Can that be dynamically loaded as needed?

It can be loaded at runtime. However, it can't be loaded conditionally. At least, not like this.

Loading conditionally ("as needed") was, in fact, the solution that I was proposing. We could have a separate library with a phony implementation of these CUDA functions. Then, import it only if we are in an environment where PyTorch has no CUDA support.

Let me have a first implementation. We can remove it, if that's not what we want.

@ysiraichi
Copy link
Collaborator Author

I have worked on this for a while, now, trying a bunch of things. Unfortunately, none of them worked. Here's the current state of things:

What I tried:

  • Created a new Python library _XLAC_cuda_functions.so that holds the definition for the c10::cuda functions I need
    • Idea: introduce a definition to those c10::cuda functions whenever PyTorch doesn't have CUDA support
  • Modified the torch_xla/__init__.py so that we import this library if not torch.cuda.available()
    • If CUDA is available, we rely on import torch to load libc10_cuda.so, which brings definition to c10::cuda functions

What is happening:

  • Even though I'm able to import the new _XLAC_cuda_functions.so library, I'm still getting undefined reference for c10::cuda functions when import torch_xla is called

I'm not sure why this is not working given that:

$ nm -CD _XLAC.cpython-310-x86_64-linux-gnu.so | grep c10::cuda
                 U c10::cuda::set_device(signed char)
                 U c10::cuda::current_device()
                 U c10::cuda::device_synchronize()

$ nm -CD _XLAC_cuda_functions.cpython-310-x86_64-linux-gnu.so | grep c10::cuda
000000000002c0b1 T c10::cuda::set_device(signed char)
000000000002c09e T c10::cuda::current_device()
000000000002c0cd T c10::cuda::device_synchronize()
# This works!
$ LD_PRELOAD=./_XLAC_cuda_functions.cpython-310-x86_64-linux-gnu.so python -c "import torch_xla"

# This doesn't work...
$ python -c "import _XLAC_cuda_functions; import torch_xla"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "xla/torch_xla/__init__.py", line 11, in <module>
    import _XLAC
ImportError: xla/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14current_deviceEv

@JackCaoG @vanbasten23 @lezcano @will-cromar
Any thoughts?

@isuruf
Copy link

isuruf commented Jun 27, 2024

python imports the libraries with RTLD_LOCAL which means the symbols from _XLAC_cuda_functions are not added to the global symbol table. You need to set RTLD_GLOBAL before importing _XLAC_cuda_functions.

import sys, os
prev = sys.getdlopenflags()
sys.setdlopenflags(prev | os.RTLD_GLOBAL)
import _XLAC_cuda_functions
sys.setdlopenflags(prev)

import torch_xla


namespace c10::cuda {

c10::DeviceIndex current_device() { fail("c10::cuda::current_device()"); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mean, if pytorch is built without CUDA, then these phony definition will be used. Without it, doing import torch_xla will fail with something like ImportError: xla/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14current_deviceEv?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

cc_binary(
name = "_XLAC_cuda_functions.so",
copts = [
"-fopenmp",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how the copts and linkopts are determined

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I just copied them from _XLAC. I guess I could get rid of them, though. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it works now, feel free to keep it :p

if not torch.cuda.is_available():
# Load _XLAC_cuda_functions to RTLD_GLOBAL, so that it can be used by _XLAC.
flags = sys.getdlopenflags()
sys.setdlopenflags(flags | os.RTLD_NOW | os.RTLD_GLOBAL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use os.RTLD_NOW to perform all necessary relocations when dlopen is called, as I don't see in isuruf's example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally we discussed that it would be better, just to be safe. That's because dlopen needs one of RTLD_NOW or RTLD_LAZY.

import tempfile
import warnings

import torch

if not torch.cuda.is_available():
# Load _XLAC_cuda_functions to RTLD_GLOBAL, so that it can be used by _XLAC.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point me to the place where _XLAC_cuda_functions is used by _XLAC?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I guess you mean the phone definition will be made available when we do import _XLAC below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Exactly.

import _XLAC_cuda_functions

# Then, restore the original flags.
sys.setdlopenflags(flags)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need to restore the original flags?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess that, in general, we don't really want to load things and make them available globally (i.e. some encapsulation for loaded functions).

// device.
//
// This variable is updated over the course of 'to_cuda' calls.
c10::DeviceIndex common_device = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the common_device. Do we ever change the value after this line?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the original tgt_device be easier to understand?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do change inside to_cuda function. There, we check the device of every XLA tensor, so that we are able to synchronize the computation later.

Would the original tgt_device be easier to understand?

Since they have different types, and are used in different ways, I thought that it would be a bit confusing to name it tgt_device.

Copy link
Collaborator

@vanbasten23 vanbasten23 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. Maybe add some comment such as: common_device refers to the device that all tensors should be on; Ideally, all the tensors should be on the same device. Wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did that.

  // Common device for all XLA tensors.
  //
  // CUDA OpenXLA fallback is supported only when all XLA tensors live in
  // the same XLA device. This field should be updated and checked every
  // time we convert an XLA tensor argument into a CUDA tensor.
  c10::Device common_device;

opt_tensors[idx] = cuda_tensors[i];
}
(*stack)[arguments_begin + idx] = c10::IValue(opt_tensors);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cpu implementation, there is a

else if (ivalue.isDevice()) {
      tgt_device = ivalue.toDevice();
      (*stack)[arguments_begin + idx] = c10::IValue(c10::Device(kCPU));
    }

why don't we need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I thought we didn't need it, since we would always be on XLA. But, I guess it's important to have it just to be safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Let me know what you think.

// If any input tensors are mutable aliases, we need to
// directly copy the updated data on the CUDA tensors back to the original
// inputs.
for (const auto i : c10::irange(tensor_args_indices.size())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does torch_xla has a concept of mutable aliases?
Also, do you know if there is a specific test cases for step3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The functional layer abstracts it for PyTorch/XLA. That's why we have to propagate the results back to the input arguments that were mutated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you know if there is a specific test cases for step3?

Not sure we have those in PyTorch/XLA. For that, we would need an operation that mutates at least one of the inputs, e.g. operations that have the out parameter.

@vanbasten23
Copy link
Collaborator

Mostly LGTM with minor comments.

Amazing work!

Comment on lines 58 to 88
struct DeviceInfo {
DeviceInfo(c10::Device device, c10::DeviceIndex i = -1)
: common_device(device), index(i) {}

// Synchronizes the CUDA device being used by PyTorch.
void synchronize() {
TORCH_CHECK(index != -1, "No defined XLA tensors found for CUDA fallback: ",
op.operator_name());

// Save the current PyTorch device, in case it's not the same as the
// recorded tensor device.
c10::DeviceIndex current = c10::cuda::current_device();
c10::cuda::set_device(index);
c10::cuda::device_synchronize();
c10::cuda::set_device(current);
}

// Common device for all XLA tensors.
//
// CUDA OpenXLA fallback is supported only when all XLA tensors live in
// the same XLA device. This field should be updated and checked every
// time we convert an XLA tensor argument into a CUDA tensor.
c10::Device common_device;

// CUDA device index where the tensors live in.
//
// This is used for synchronizing the device where the fallback operation
// was called. This should ensure completion of the CUDA computation, in
// order to be used by another XLA computation.
c10::DeviceIndex index;
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct helps making sure only one device is used in the CUDA OpenXLA fallback.

  • XLA tensor devices are checked against common_device
  • CUDA device index is also checked just to be safe

@ysiraichi
Copy link
Collaborator Author

Running TorchBench with --verify flag, showed no new accuracy problems on inference (the flag doesn't work with training). Therefore, I will go on and merge this PR.

@ysiraichi ysiraichi merged commit c782e0d into master Jul 3, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants