Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang cannot detect hermetic cuda version #16877

Open
tchatow opened this issue Sep 6, 2024 · 10 comments · May be fixed by #16882
Open

Clang cannot detect hermetic cuda version #16877

tchatow opened this issue Sep 6, 2024 · 10 comments · May be fixed by #16882

Comments

@tchatow
Copy link

tchatow commented Sep 6, 2024

When using the hermetic CUDA toolchain with Clang, the CUDA version is not actually detected by Clang and defaults to the latest known toolkit version. Clang searches for external/cuda_nvcc/include/cuda.h, and when it is not found, the version detection logic switches to CudaVersion::NEW.

For instance, in clang 18.1.8, CudaVersion::NEW is 12.3. When we set HERMETIC_CUDA_VERSION="12.2.0", this mismatch and failed detection causes clang to use isa 8.3 which is incompatible with cuda 12.2.0.

The easiest fix is probably to copy external/cuda_cudart/include/cuda.h into external/cuda_nvcc/include/cuda.h.

@johnnynunez
Copy link

johnnynunez commented Sep 7, 2024

I have this error:

third_party/gpus/cuda/include/cuda.h
ts '-std=c++17' -c external/xla/xla/stream_executor/cuda/cuda_status.cc -o bazel-out/aarch64-opt/bin/external/xla/xla/stream_executor/cuda/_objs/cuda_status_cuda_only/cuda_status.pic.o)
# Configuration: d9608b0aa616855e3fabfa1f8c73e3eec1e37022bef94165bef09db9202f5654
# Execution platform: @local_execution_config_platform//:platform
In file included from external/xla/xla/stream_executor/cuda/cuda_status.cc:16:
external/xla/xla/stream_executor/cuda/cuda_status.h:22:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found

@johnnynunez
Copy link

external/xla/xla/stream_executor/cuda/cuda_status.h:22:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found
#include "third_party/gpus/cuda/include/cuda.h"
python3 build/build.py --enable_cuda --cuda_compute_capabilities=sm_87 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/usr/local/cuda-12.2" --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/usr/lib/aarch64-linux-gnu"

System info (python version, jaxlib version, accelerator, etc.)

Jetson AGX Orin 22.04 cuda 12.2

@ybaturina
Copy link

Hi @johnnynunez it was an architectural decision to make CUDA_VERSION defined explicitly instead of looking it up implicitly in the old non-hermetic CUDA rules.

So, if you are using a local source of CUDA/CUDNN redistributions (which is not recommended), you still need to pass the correct HERMETIC_CUDA_VERSION and HERMETIC_CUDNN_VERSION in the parameters of Python script.

Also please make sure that the structure of the folders with CUDA, CUDNN and NCCL is exactly the same as described in the instructions. This structure is in line with the structure of redistributions which can be downloaded from NVIDIA site.

If you absolutely need repository rule to discover the CUDA version installed locally, you can use the deprecated method documented here.

@johnnynunez
Copy link

johnnynunez commented Sep 11, 2024

Hi @johnnynunez it was an architectural decision to make CUDA_VERSION defined explicitly instead of looking it up implicitly in the old non-hermetic CUDA rules.

see internally the jsons
Sbsa is for arm64 servers
Tegra is for edge devices

both are aarch64
#16905

So, if you are using a local source of CUDA/CUDNN redistributions (which is not recommended), you still need to pass the correct HERMETIC_CUDA_VERSION and HERMETIC_CUDNN_VERSION in the parameters of Python script.

Also please make sure that the structure of the folders with CUDA, CUDNN and NCCL is exactly the same as described in the instructions. This structure is in line with the structure of redistributions which can be downloaded from NVIDIA site.

If you absolutely need repository rule to discover the CUDA version installed locally, you can use the deprecated method documented here.

Yes, I totally agree. But you have or rather XLA has the failure to consider SBSA as AARCH64. When the jetson is tegra chip and uses aarch64 but they are other packages

@ybaturina
Copy link

ybaturina commented Sep 11, 2024

Yes, I totally agree. But you have or rather XLA has the failure to consider SBSA as AARCH64. When the jetson is tegra chip and uses aarch64 but they are other packages.

Thank you for the clarification, I understand the issue now.
I asked about linux-aarch64 packages a while ago, and I was told that I can use linux-sbsa instead. Also I noticed that linux-sbsa had newer versions than linux-aarch64.
Is there any other indication that Jetson platform is used, apart from the environment variable JETSON_PLATFORM?

@johnnynunez
Copy link

johnnynunez commented Sep 11, 2024

Thank you for the clarification, I understand the issue now.
I asked about linux-aarch64 packages a while ago, and I was told that I can use linux-sbsa instead. Also I noticed that linux-sbsa had newer versions than linux-aarch64.
Is there any other indication that Jetson platform is used, apart from the environment variable JETSON_PLATFORM?

Hello,
Now jetson has sota packages, it is like PC with RTX. They are moving fast because jetson thor based on blackwell is coming end of the year also.

are there JETSON_PLATFORM variable?
I mean, because in the list of packages I didn’t see it.

i’ve tried to differentiate getting the board id, like jetson containers does. https://github.com/dusty-nv/jetson-containers/blob/master/jetson_containers/l4t_version.py

jetson doesn’t have NCCL.
Jetson has:
Cuda 12.6.1
Cudnn 9.4.0
Tensorrt 10.4.0

example:
Captura de pantalla 2024-09-11 a las 23 22 38

@ybaturina
Copy link

Hi @johnnynunez , can we use L4T_VERSION environment variable to determine if linux-aarch64 packages should be downloaded instead of linux-sbsa?

@johnnynunez
Copy link

Hi @johnnynunez , can we use L4T_VERSION environment variable to determine if linux-aarch64 packages should be downloaded instead of linux-sbsa?

my idea was like detect automatically:
https://github.com/openxla/xla/pull/16905/files

@ybaturina
Copy link

Do you mean the line is_jetson = repository_ctx.os.environ.get("JETSON_PLATFORM", None)? Is there a guarantee that JETSON_PLATFORM environment variable is always present in such builds?

@ybaturina
Copy link

I've posted a workaround here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants