{ai}[gfbf/2024a] jax v0.4.35, ml_dtypes v0.5.0 w/ CUDA 12.6.0 WIP#21924
{ai}[gfbf/2024a] jax v0.4.35, ml_dtypes v0.5.0 w/ CUDA 12.6.0 WIP#21924ThomasHoffmann77 wants to merge 36 commits intoeasybuilders:developfrom
Conversation
… jax-0.4.35_easyblock_compat.patch, jax-0.4.35_fix-pybind11-systemlib_cupti.patch
Updated software
|
- comment bazel 7 problem. - temporarily switch off tests.
alt dep pybind11
Temp. mv Pybind11 from builddep to dep
mv pybind11 to builddependencies
|
where is the easyconfig for Clang-18.1.8-gfbf-2024a-CUDA-12.6.0.eb |
|
| ] | ||
|
|
||
| dependencies = [ | ||
| ('Java', '11.0.20', '', SYSTEM), |
There was a problem hiding this comment.
@boegel Build of Bazel 6.5.0 fails for both, Java/21 and Java/21.0.2.
| cuda_compute_capabilities = ["5.0", "6.0", "6.1", "7.0", "7.5", "8.0", "8.6", "9.0"] | ||
|
|
||
| builddependencies = [ | ||
| # ('Bazel', '7.4.1'), TODO: problems with @@local_config_python//:py3_runtime: |
There was a problem hiding this comment.
Hmm, it's unfortunate we can't use Bazel 7.4.1...
Do we fully understand what's going on here, is it a fundamental incompatibility?
There was a problem hiding this comment.
@boegel: I am not an bazel expert, but I think it is rather a fundamental incompatibility as the latest jax 0.5.2 still is using Bazel v6.5.0: https://github.com/jax-ml/jax/blob/ce224293b1a7d9b39b5d9194d429b54f38faf6fe/.bazelversion#L1
There was a problem hiding this comment.
@boegel Jax 0.6.0 is using Bazel 7.4.1. Since AF3 is not merged yet, it might be worth to drop jax 0.4.34 and use 0.6.0 instead.
|
I got the following error (when running AlphaFold3 test-suite): libdevice.10.bc is located at I suspect this function https://github.com/jax-ml/jax/blob/jax-v0.4.34/jax/_src/lib/__init__.py#L130-L138 . |
|
This patch got rid of the libdevice error for me (for some reason couldn't find Thomas' repo listed to do a PR against (commit)): |
@VRehnberg is it sufficient to patch the python code only? I have some other patch, which I did not upload yet. It modifies |
Sorry, I don't understand what you're asking. Root issue is that |
@VRehnberg I switched to 0.4.35 and added a patch to find libdevice.10.bc relative to $CUDA_HOME |
|
@ThomasHoffmann77 Any updates on this? |
@boegel We have this jax 0.4.35 running with AF3 at EMBL. I did not further work on an update to jax 0.6.0 yet. The current PR still downloads lots of Bazel packages at build time. Some more critical review and testing would be beneficial. |
|
For me it looks as the same problem as with TensorFlow and cupti and XLA: easybuilders/easybuild-easyblocks#3765 |
|
Likely. You can try the same patch as I've done for TF: https://github.com/easybuilders/easybuild-easyconfigs/pull/22921/files#diff-0c447a6b5b271f5000a9d56d61038dba7d149db0435eaf83bc91ead482a47c5f |
|
I just create PR for jax-0.6.2 with CUDA-12.6.0: |
(created using
eb --new-pr)requires:
TODO:
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: libdevice not found at ./libdevice.10.bc-> export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME