Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly CUDA builds are failing #8868

Open
tengyifei opened this issue Mar 20, 2025 · 1 comment
Open

Nightly CUDA builds are failing #8868

tengyifei opened this issue Mar 20, 2025 · 1 comment
Assignees
Labels
bug Something isn't working build Build process related matters (e.g. build system). CI CI related change xla:gpu

Comments

@tengyifei
Copy link
Collaborator

Example: https://github.com/pytorch/xla/runs/39089457932, https://github.com/pytorch/xla/runs/39089458228

Error excerpt from Google Cloud Build:

      Loading: 1 packages loaded
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (2 packages loaded, 0 targets configured)
      WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/platforms/releases/download/0.0.9/platforms-0.0.7.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (100 packages loaded, 730 targets configured)
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (202 packages loaded, 7959 targets configured)
      Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (236 packages loaded, 18188 targets configured)
      INFO: Analyzed target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (239 packages loaded, 20620 targets configured).
      INFO: Found 1 target...
      [0 / 951] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (10 actions, 0 running)
      [2,169 / 7,504] Compiling llvm/lib/Support/VersionTuple.cpp [for tool]; 0s remote-cache ... (31 actions, 0 running)
      [5,563 / 10,195] Compiling xla/service/gpu/kernels/topk_kernel_bfloat16.cu.cc; 0s remote-cache ... (31 actions, 0 running)
      [6,304 / 12,193] Compiling xla/service/cpu/runtime_single_threaded_matmul_f16.cc; 0s local, remote-cache ... (46 actions, 35 running)
      [6,711 / 12,239] Compiling xla/service/cpu/runtime_single_threaded_matmul_f16.cc; 1s local, remote-cache ... (56 actions, 48 running)
      [6,974 / 12,467] Compiling xla/service/cpu/runtime_single_threaded_matmul_f16.cc; 2s local, remote-cache ... (58 actions, 52 running)
      ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/xla/xla/stream_executor/cuda/BUILD:1896:11: Compiling xla/stream_executor/cuda/tma_util.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target @xla//xla/stream_executor/cuda:tma_util) external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/xla/xla/stream_executor/cuda/_objs/tma_util/tma_util.pic.d ... (remaining 158 arguments skipped)
      In file included from external/xla/xla/stream_executor/cuda/tma_util.cc:16:
      external/xla/xla/stream_executor/cuda/tma_util.h:25:16: error: ‘CUtensorMapDataType’ was not declared in this scope
         25 | absl::StatusOr<CUtensorMapDataType> GetTensorMapDataType(int element_size);
            |                ^~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.h:25:35: error: template argument 1 is invalid
         25 | absl::StatusOr<CUtensorMapDataType> GetTensorMapDataType(int element_size);
            |                                   ^
      external/xla/xla/stream_executor/cuda/tma_util.h:27:1: error: ‘CUtensorMapSwizzle’ does not name a type
         27 | CUtensorMapSwizzle GetTensorMapSwizzle(TmaDescriptor::TmaSwizzle swizzle);
            | ^~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.h:29:1: error: ‘CUtensorMapL2promotion’ does not name a type
         29 | CUtensorMapL2promotion GetTensorMapL2Promotion(
            | ^~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.h:32:1: error: ‘CUtensorMapFloatOOBfill’ does not name a type
         32 | CUtensorMapFloatOOBfill GetTensorMapFloatOOBFill(
            | ^~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.h:35:1: error: ‘CUtensorMapInterleave’ does not name a type
         35 | CUtensorMapInterleave GetTensorMapInterleave(
            | ^~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:26:16: error: ‘CUtensorMapDataType’ was not declared in this scope; did you mean ‘GetTensorMapDataType’?
         26 | absl::StatusOr<CUtensorMapDataType> GetTensorMapDataType(int element_size) {
            |                ^~~~~~~~~~~~~~~~~~~
            |                GetTensorMapDataType
      external/xla/xla/stream_executor/cuda/tma_util.cc:26:35: error: template argument 1 is invalid
         26 | absl::StatusOr<CUtensorMapDataType> GetTensorMapDataType(int element_size) {
            |                                   ^
      external/xla/xla/stream_executor/cuda/tma_util.cc: In function ‘int stream_executor::gpu::GetTensorMapDataType(int)’:
      external/xla/xla/stream_executor/cuda/tma_util.cc:29:14: error: ‘CU_TENSOR_MAP_DATA_TYPE_UINT8’ was not declared in this scope
         29 |       return CU_TENSOR_MAP_DATA_TYPE_UINT8;
            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:31:14: error: ‘CU_TENSOR_MAP_DATA_TYPE_UINT16’ was not declared in this scope
         31 |       return CU_TENSOR_MAP_DATA_TYPE_UINT16;
            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:33:14: error: ‘CU_TENSOR_MAP_DATA_TYPE_UINT32’ was not declared in this scope
         33 |       return CU_TENSOR_MAP_DATA_TYPE_UINT32;
            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:35:14: error: ‘CU_TENSOR_MAP_DATA_TYPE_UINT64’ was not declared in this scope
         35 |       return CU_TENSOR_MAP_DATA_TYPE_UINT64;
            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:37:40: error: cannot convert ‘absl::lts_20230802::Status’ to ‘int’ in return
         37 |       return absl::InvalidArgumentError(
            |              ~~~~~~~~~~~~~~~~~~~~~~~~~~^
            |                                        |
            |                                        absl::lts_20230802::Status
         38 |           absl::StrFormat("unsupported element size: %d", element_size));
            |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc: At global scope:
      external/xla/xla/stream_executor/cuda/tma_util.cc:42:1: error: ‘CUtensorMapSwizzle’ does not name a type
         42 | CUtensorMapSwizzle GetTensorMapSwizzle(TmaDescriptor::TmaSwizzle swizzle) {
            | ^~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:55:1: error: ‘CUtensorMapL2promotion’ does not name a type
         55 | CUtensorMapL2promotion GetTensorMapL2Promotion(
            | ^~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:69:1: error: ‘CUtensorMapFloatOOBfill’ does not name a type
         69 | CUtensorMapFloatOOBfill GetTensorMapFloatOOBFill(
            | ^~~~~~~~~~~~~~~~~~~~~~~
      external/xla/xla/stream_executor/cuda/tma_util.cc:79:1: error: ‘CUtensorMapInterleave’ does not name a type
         79 | CUtensorMapInterleave GetTensorMapInterleave(
            | ^~~~~~~~~~~~~~~~~~~~~
      Target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build
      Use --verbose_failures to see the command lines of failed build steps.
      INFO: Elapsed time: 45.902s, Critical Path: 4.05s
      INFO: 7037 processes: 2405 remote cache hit, 4611 internal, 21 local.
      FAILED: Build did NOT complete successfully
      INFO: Streaming build results to: https://source.cloud.google.com/results/invocations/7d4b02f3-cb2b-4ea6-85bd-b93f1ac969dd
      Traceback (most recent call last):
        File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-yehbnrwl/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-yehbnrwl/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-yehbnrwl/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 320, in run_setup
          exec(code, locals())
        File "<string>", line 11, in <module>
        File "/src/pytorch/xla/plugins/cuda/../../build_util.py", line 67, in bazel_build
          subprocess.check_call(bazel_argv, stdout=sys.stdout, stderr=sys.stderr)
        File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['bazel', 'build', '@xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so', '--symlink_prefix=/src/pytorch/xla/plugins/cuda/bazel-', '--config=remote_cache', '--config=cuda', '--remote_default_exec_properties=cache-silo-key=cache-silo-amd64-cuda-17']' returned non-zero exit status 1.
      error: subprocess-exited-with-error
  
@ysiraichi ysiraichi added xla:gpu CI CI related change labels Mar 20, 2025
@ysiraichi
Copy link
Collaborator

This looks like a CUDA version problem. Maybe we need to update OpenXLA into a version that supports CUDA 12.8.

@ysiraichi ysiraichi added bug Something isn't working build Build process related matters (e.g. build system). labels Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Build process related matters (e.g. build system). CI CI related change xla:gpu
Projects
None yet
Development

No branches or pull requests

2 participants