{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365
{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365Flamefire wants to merge 19 commits intoeasybuilders:developfrom
Conversation
|
Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use |
9849937 to
15b85aa
Compare
|
Test report by @Flamefire |
|
Test report by @boegel |
|
I'm also seeing a crash with the |
|
That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793 |
|
I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae
Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional |
2b8bc42 to
64a4d67
Compare
|
Rebased to remove EasyConfigs present in develop from this branch. Also added 2 more patches to avoid remaining failures. |
|
Test report by @Flamefire |
01180ef to
381c028
Compare
|
Test report by @Flamefire |
…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch
9e1d5c8 to
867d1a2
Compare
|
@Flamefire Can we sync this with |
|
@boegelbot please test @ jsc-zen3-a100 |
|
|
||
| runtest = ( | ||
| ' TORCH_DISABLE_ADDR2LINE=1' | ||
| ' TORCHINDUCTOR_CUTLASS_DIR=%(start_dir)s/third_party/cutlass' |
There was a problem hiding this comment.
@Flamefire Just wondering: a nvidia-cutlass dependency is not a (good) option for PyTorch 2.9.1?
There was a problem hiding this comment.
I tried but the structure is different. PyTorch expects a literal git checkout -.-
At least there is nothing to be built, just a plain copy
|
@Flamefire I guess we'll need an increase for |
|
@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 3780659015 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegel |
|
Test report by @boegel |
Something's off there... The latest (below) has 2 missing suites:
|
|
Test report by @boegel |
|
Test report by @boegelbot |
|
Very similar result from our A100 and H100 systems. For H100: So Similar issue on |
Yes see also my previous comment. Can you attach the log? I assume it is a common issue that will easily fix all/most tests |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
9b793dd to
a02dba5
Compare
|
Test report by @Flamefire |
(created using
eb --new-pr)Requires: