Debugging JAX not working within a Docker container on HPC cluster #23195

rainx0r · 2024-08-22T14:35:45Z

rainx0r
Aug 22, 2024

Hi!

I'm a PhD student very happily using JAX for my smaller ML experiments locally or on limited private compute and recently I got access to an HPC cluster that I want to use to run longer running or larger scale experiments on. This cluster relies on its users to make docker containers including any dependencies. I made a Dockerfile which essentially just installs jax[cuda12] inside the python:3.12.5-slim image. This means that CUDA inside the container is installed through pip from JAX's extras.

The resulting image works fine locally with docker run --rm --gpus all <my_image_tag> <my_script>, it sees the GPUs and runs fine, but it absolutely does not work when deployed on the HPC cluster, which is very curious. JAX does not see GPU assigned to the container from the cluster's job runner and defaults to the CPU.

The thing is, I am not sure how to proceed debugging this issue because there are actually no logs produced from JAX. Locally in the past when I've gotten the JAX install wrong I've seen things like No GPU/TPU found. Defaulting to CPU usually following some errors that are informative about the part that's going wrong (cuDNN not found, cuda not found, etc). Inside the docker container, I can't get such logs to produce. I've tried TF_CPP_MAX_VLOG_LEVEL=2 TF_CPP_MIN_LOG_LEVEL=0 from another discussion and those flags do get some XLA logs to produce but they're not very informative, mostly logs about libraries successfully loading dynamically. I've also tried setting the Python logging level to debug with logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG) and that does get some jaxlib logs to produce:

DEBUG:jax._src.path:etils.epath found. Using etils.epath for file I/O.
DEBUG:jax._src.xla_bridge:Discovered path based JAX plugin: jax_plugins.xla_cuda12
DEBUG:jax._src.xla_bridge:Discovered entry-point based JAX plugin: jax_plugins.xla_cuda12
DEBUG:jax._src.xla_bridge:Loading plugin module jax_plugins.xla_cuda12
DEBUG:jax._src.xla_bridge:registering PJRT plugin cuda from /usr/local/lib/python3.12/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
DEBUG:jax._src.xla_bridge:Initializing backend 'cpu'
DEBUG:jax._src.xla_bridge:Backend 'cpu' initialized
DEBUG:jax._src.xla_bridge:Initializing backend 'rocm'
INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
DEBUG:jax._src.xla_bridge:Initializing backend 'tpu'
INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory

For reference these are the logs I see with `TF_CPP_MAX_VLOG_LEVEL=2` and `TF_CPP_MIN_LOG_LEVEL=0`

>>> import jax
2024-08-22 15:27:47.639138: I external/tsl/tsl/platform/cloud/gcs_file_system.cc:855] GCS cache max size = 0 ; block size = 67108864 ; max staleness = 0
2024-08-22 15:27:47.639228: I external/tsl/tsl/platform/cloud/ram_file_block_cache.h:64] GCS file block cache is disabled
2024-08-22 15:27:47.639236: I external/tsl/tsl/platform/cloud/gcs_file_system.cc:895] GCS DNS cache is disabled, because GCS_RESOLVE_REFRESH_SECS = 0 (or is not set)
2024-08-22 15:27:47.639242: I external/tsl/tsl/platform/cloud/gcs_file_system.cc:925] GCS additional header DISABLED. No environment variable set.
2024-08-22 15:27:47.639249: I external/tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-08-22 15:27:47.639255: I external/tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-08-22 15:27:47.672163: I external/tsl/tsl/platform/default/dso_loader.cc:73] Successfully opened dynamic library libcudart.so.12
2024-08-22 15:27:47.673114: I external/tsl/tsl/platform/default/dso_loader.cc:73] Successfully opened dynamic library libcudart.so.12
>>> jax.devices()
2024-08-22 15:27:51.006815: I external/tsl/tsl/platform/default/dso_loader.cc:73] Successfully opened dynamic library libcudart.so.12
2024-08-22 15:27:51.008413: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'cu_threefry2x32_ffi'; platform=CUDA (canonical=cuda), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008435: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'cu_lu_pivots_to_permutation'; platform=CUDA (canonical=cuda), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008621: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'blas_strsm_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008635: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'blas_dtrsm_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008643: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'blas_ctrsm_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008651: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'blas_ztrsm_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008660: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_sgetrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008669: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_dgetrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008676: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_cgetrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008684: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_zgetrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008691: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_sgeqrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008698: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_dgeqrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008707: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_cgeqrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008715: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_zgeqrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008722: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_sorgqr_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008730: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_dorgqr_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008738: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_cungqr_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008749: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_zungqr_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008758: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_spotrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008765: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_dpotrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008773: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_cpotrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008781: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_zpotrf_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008788: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_sgesdd_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008796: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_dgesdd_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008804: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_cgesdd_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.008813: I external/xla/xla/ffi/ffi_api.cc:189] Register XLA FFI handler for 'lapack_zgesdd_ffi'; platform=Host (canonical=host), stages=[execute], command_buffer_compatible=false
2024-08-22 15:27:51.009524: I external/xla/xla/parse_flags_from_env.cc:204] For env var XLA_FLAGS found arguments:
2024-08-22 15:27:51.009540: I external/xla/xla/parse_flags_from_env.cc:206]   argv[0] = <argv[0]>
[CpuDevice(id=0)]

I am really not sure what else to do to debug why JAX isn't seeing the GPU from within the container. I can SSH onto the container itself on the cluster and run an interactive python shell as well as any other command (e.g. nvidia-smi which does show the GPU and its driver, which btw is 560 which has CUDA 12.6 support).

Do you have any suggestions for things I should try to figure out what's going on? I'm thinking parts of the system's CUDA must be somehow interfering with the container's pip CUDA and for some reason JAX isn't loading CUDA / cuDNN from site-packages, but I've no idea how to find out what JAX is actually trying to load or what exactly is failing.

yashk2810 · 2024-08-23T03:46:03Z

yashk2810
Aug 23, 2024
Collaborator

Which jax version do you have? Try the latest jax and jaxlib 0.4.31?

This was a regression introduced in 0.4.30 but fixed in 0.4.31

2 replies

yashk2810 Aug 23, 2024
Collaborator

If even that doesn't work, try nightly?

rainx0r Aug 23, 2024
Author

I'm on 0.4.31.

In the end the nightly did fix my issue in this particular case but I'm still curious as to what flags there are for logging and how I can better capture JAX logs manually (given how by default they don't seem to appear within the docker container). The problem ended up being what this commit fixes, so in this case JAX didn't even attempt to initialise cuda since it didn't detect any visible GPUs, so that explains the lack of logs for initialising the cuda backend I guess. But shouldn't I at least be seeing a warning that no GPU/TPU was found? Or was that removed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging JAX not working within a Docker container on HPC cluster #23195

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Debugging JAX not working within a Docker container on HPC cluster #23195

rainx0r Aug 22, 2024

Replies: 1 comment · 2 replies

yashk2810 Aug 23, 2024 Collaborator

yashk2810 Aug 23, 2024 Collaborator

rainx0r Aug 23, 2024 Author

rainx0r
Aug 22, 2024

Replies: 1 comment 2 replies

yashk2810
Aug 23, 2024
Collaborator

yashk2810 Aug 23, 2024
Collaborator

rainx0r Aug 23, 2024
Author