{ai}[gfbf/2024a] jax v0.6.2 w/ CUDA 12.6.0#24141
{ai}[gfbf/2024a] jax v0.6.2 w/ CUDA 12.6.0#24141boegel merged 3 commits intoeasybuilders:developfrom
Conversation
Updated software
|
|
@boegelbot please test @ jsc-zen3-a100 |
|
@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 3376569345 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @Flamefire |
|
Test report by @pavelToman |
One test error - missing |
|
Hm, it detects |
The same problem had @boegel, what was the solution of "missing EDIT: |
I think so |
|
Test report by @pavelToman |
|
Test report by @pavelToman |
|
Both failures on accelgor and litleo comes from 4 tests: It seems as a problem with NCCL/CUDA on the clusters - will investigate what is going on there EDIT: NCCL was build with cuda-compute-capabilities=8.6 - going to rebuild NCCL with right cuda-compute-capabilities for each cluster (8.0 and 9.0 for accelgor and litleo). |
|
Test report by @pavelToman Also on litleo all tests passed but installation failed just after |
|
Test report by @pavelToman |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
@Flamefire Can you try again after fixing the missing |
|
Test report by @Flamefire |
Same error as before with the latest test. Asking upstream: jax-ml/jax#32799 |
Should I made a patch with this change in jax/_src/clusters/slurm_cluster.py ? |
I think that's best, yes, as other might run into the same issue |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 3484937147 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @boegel |
|
Going in, thanks @pavelToman! |
(created using
eb --new-pr)resolves vscentrum/vsc-software-stack#477
requires: