Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc #23575

Open
benkirk opened this issue Sep 11, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@benkirk
Copy link

benkirk commented Sep 11, 2024

Description

I'm attempting to build jaxlib with a local CUDA, CUDNN, and NCCL. I'm running into (different) issues with either gcc of clang. Any ideas??:

Build command:

python build/build.py \
       --build_gpu_plugin --gpu_plugin_cuda_version=12 \
       --verbose \
       --enable_mkl_dnn \
       --enable_nccl \
       --enable_cuda \
       --cuda_compute_capabilities 8.0 \
       --target_cpu_features release \
       --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

clang error:

external/tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found

gcc error:

# Configuration: d3d6c18c79c5128461901902331e6ad5ab5bc83fb9ca1bc29bc506f7fe919c16
# Execution platform: @local_execution_config_platform//:platform
gcc: error: unrecognized command-line option '--cuda-path=external/cuda_nvcc'

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.4.31
jaxlib: 0.4.31
numpy:  2.1.1
python: 3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 11:01:28) [GCC 13.3.0]
jax.devices (2 total, 2 local): [CudaDevice(id=0) CudaDevice(id=1)]
process_count: 1
platform: uname_result(system='Linux', node='derecho7', release='5.14.21-150400.24.18-default', version='#1 SMP PREEMPT_DYNAMIC Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)', machine='x86_64')


$ nvidia-smi
Wed Sep 11 12:37:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   51C    P0              68W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:C3:00.0 Off |                    0 |
| N/A   53C    P0              75W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     54137      C   python                                      416MiB |
|    1   N/A  N/A     54137      C   python                                      416MiB |
+---------------------------------------------------------------------------------------+
@benkirk benkirk added the bug Something isn't working label Sep 11, 2024
@johnnynunez
Copy link

You have to put the cuda version and cudnn version unfortunaly.
Clang not detect automatically.
If you are using this setup. Maybe is better that you use JAX Toolbox.
https://github.com/NVIDIA/JAX-Toolbox

@ybaturina
Copy link
Contributor

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

@johnnynunez
Copy link

johnnynunez commented Sep 11, 2024

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

the problem is here:
openxla/xla#16877

I avoid a lot of problems.
dusty-nv/jetson-containers#626

@johnnynunez
Copy link

Also, this is necessary: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/install-cudnn.sh and this: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/build-jax.sh

ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib

I've update the script to not download the files.

#!/bin/bash

set -e

CUDNN_MAJOR_VERSION=9
CUDA_MAJOR_VERSION=12.2
prefix=/opt/nvidia/cudnn
arch=$(uname -m)-linux-gnu
cuda_base_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}"

# Comprobar si la ruta especificada de CUDA existe
if [[ -d "${cuda_base_path}" ]]; then
  cuda_lib_path="${cuda_base_path}/lib64"
  output_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}/lib"
else
  cuda_lib_path="/usr/local/cuda/lib64"
  output_path="/usr/local/cuda/lib64"
fi

# Crear enlace simbólico para CUDA
sudo ln -s "${cuda_lib_path}" "${output_path}"

# Proceso para CUDNN
for cudnn_file in $(dpkg -L libcudnn${CUDNN_MAJOR_VERSION} libcudnn${CUDNN_MAJOR_VERSION}-dev | sort -u); do
  if [[ -f "${cudnn_file}" || -h "${cudnn_file}" ]]; then
    nosysprefix="${cudnn_file#"/usr/"}"
    noarchinclude="${nosysprefix/#"include/${arch}"/include}"
    noverheader="${noarchinclude/%"_v${CUDNN_MAJOR_VERSION}.h"/.h}"
    noarchlib="${noverheader/#"lib/${arch}"/lib}"
    
    # Usar la ruta cuda_base_path o /usr/local/cuda/lib64
    if [[ -d "${cuda_base_path}" ]]; then
      link_name="${cuda_base_path}/${noarchlib}"
    else
      link_name="/usr/local/cuda/lib64/${noarchlib}"
    fi
    
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${cudnn_file}" "${link_name}"
  fi
done

@benkirk
Copy link
Author

benkirk commented Sep 11, 2024

Thank you both, in my case

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"

I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

@johnnynunez
Copy link

johnnynunez commented Sep 11, 2024

Thank you both, in my case

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"

I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

@hawkinsp
Copy link
Collaborator

@benkirk You don't need to build JAX from source to use a custom NCCL. We'll use whichever libnccl.so we find in your LD_LIBRARY_PATH.

@benkirk
Copy link
Author

benkirk commented Sep 11, 2024

Thanks @hawkinsp, I've got my NCCL injected with jax[cuda12]=0.4.31 properly from PIP, had a few issues trying jax[cuda12_local]=0.4.31 ; I'll revisit that as an alternative parallel path.

@ybaturina
Copy link
Contributor

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

hi @johnnynunez, I understand your concerns, I tried to address them in the comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants