{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1#20119
{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1#20119lexming merged 25 commits intoeasybuilders:developfrom
Conversation
|
Test report by @ThomasHoffmann77 |
|
Test report by @ThomasHoffmann77 |
|
Test report by @ThomasHoffmann77 |
|
Test report by @branfosj Same three failures as #19841 (comment) |
|
Test report by @branfosj Same three failures as #19841 (comment) |
|
Test report by @ThomasHoffmann77 |
|
I don't have a build node setup to upload test reports. Did see this test error: |
|
I see you're all building with |
|
|
Test report by @ThomasHoffmann77 |
|
Test report by @ThomasHoffmann77 |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
In both cases the failure is: Due to XLA comes with even more dependencies ( |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
|
Test report by @Flamefire This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/m/ml_dtypes/ml_dtypes-0.3.2-foss-2023a.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
|
Test report by @casparvl |
|
Nope, using single compute capability for the H100 (9.0) also fails in the same way. |
0.4.29 has been released in the meantime. It might be worth to try this version. |
|
First attempt at using 0.4.29 with this toolchain failed: |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
To get this to run on H100 one needs a newer CUDA, 0.4.29 with foss/2023a and CUDA/12.5.0 passes all but a single broken test, i.e. the test itself is broken... |
|
🎉 |
easybuild/easyconfigs/j/jax/jax-0.4.25-gfbf-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
fix local_extract_cmd according to @akesandgren 's suggestion
|
Edit: Recent change in global pip.conf made build fail. Unrelated to this PR. |
Which one exactly and what was the error? Might be worth addressing in framework |
https://gist.github.com/VRehnberg/5be54199260e8a478002d18dd986725c |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
Ah I remember that. There is a fix in the easyblocks: easybuilders/easybuild-easyblocks#3374 |
Thanks, wasn't using that easyblock still on 4.9.2 easyblocks.
So the tests are not VRAM hungry. I never saw more than 1 GB used and T4 only have 16 GB in total so that's a strict limit even if our monitoring would miss a short spike. Does use about 27 GB of regular RAM though in case that could be an issue. |
|
Test report by @lexming |
|
Merging, thanks a lot for keeping up with all the issues @ThomasHoffmann77 ! |
(created using
eb --new-pr)requires:
edit: requires bug fix in framework for "
cp %s %(builddir)s/archives" to work as extract command: