Skip to content

{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365

Open
Flamefire wants to merge 19 commits intoeasybuilders:developfrom
Flamefire:20251024183337_new_pr_PyTorch290
Open

{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365
Flamefire wants to merge 19 commits intoeasybuilders:developfrom
Flamefire:20251024183337_new_pr_PyTorch290

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Oct 24, 2025

@Flamefire Flamefire marked this pull request as draft October 24, 2025 16:33
@github-actions
Copy link

github-actions bot commented Oct 24, 2025

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre Thyre added the 2024a issues & PRs related to 2024a common toolchains label Oct 25, 2025
@Flamefire Flamefire marked this pull request as ready for review December 9, 2025 11:31
@Flamefire Flamefire changed the title {ai}[foss/2024a] PyTorch v2.9.0 w/ CUDA 12.6.0 {ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 Dec 9, 2025
@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9849937 to 15b85aa Compare December 9, 2025 11:45
@github-actions github-actions bot added the new label Dec 9, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 27 hours 27 mins 43 secs) (1 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/50cdd13305fd9a33c6140c223aeab6cd for a full test report.

@boegel
Copy link
Member

boegel commented Dec 13, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 4 out of 6 (total: 8 mins 45 secs) (6 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e06f98f956452bcb8a132f816e55927b for a full test report.

@boegel
Copy link
Member

boegel commented Dec 13, 2025

I'm also seeing a crash with the triton_test.py script:

== FAILED: Installation ended unsuccessfully: Sanity check failed: sanity check command TRITON_HOME=$TMPDIR/eb-triton_home python
/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py 8.0 failed with exit code 1 (output: Traceback (most recent call last):
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py", line 13, in <module>
    src = triton.compiler.ASTSource(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/lib/python3.12/site-packages/triton/compiler/compiler.py", line 67, in __init__
    for k in self.signature.keys():
             ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'keys'

@Flamefire
Copy link
Contributor Author

That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793

@Flamefire
Copy link
Contributor Author

I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae

The lowest tlparse version that works is 0.3.42.

Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 2b8bc42 to 64a4d67 Compare December 16, 2025 17:04
@Flamefire
Copy link
Contributor Author

Rebased to remove EasyConfigs present in develop from this branch.

Also added 2 more patches to avoid remaining failures.

@github-actions github-actions bot removed the new label Dec 16, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 4 mins 48 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/6f2e5c7a020ec72dff9d2f5c8220fba5 for a full test report.

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch 2 times, most recently from 01180ef to 381c028 Compare December 17, 2025 09:09
@boegel boegel added this to the release after 5.2.0 milestone Dec 18, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 24 hours 42 mins 21 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/a6bd885643124f5fac4864060e0e18cd for a full test report.

…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch
@boegel
Copy link
Member

boegel commented Jan 20, 2026

@Flamefire Can we sync this with develop after merge of #23923?

@boegel
Copy link
Member

boegel commented Jan 21, 2026

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16


runtest = (
' TORCH_DISABLE_ADDR2LINE=1'
' TORCHINDUCTOR_CUTLASS_DIR=%(start_dir)s/third_party/cutlass'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Flamefire Just wondering: a nvidia-cutlass dependency is not a (good) option for PyTorch 2.9.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but the structure is different. PyTorch expects a literal git checkout -.-

At least there is nothing to be built, just a plain copy

@boegel
Copy link
Member

boegel commented Jan 21, 2026

@Flamefire I guess we'll need an increase for max_failed_tests here too, like we did in #23923 (w.r.t. test results on V100 system)?
I can look into submitting a test report from a V100 system to see how "bad" it is...

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24365 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24365 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9464

Test results coming soon (I hope)...

Details

- notification for comment with ID 3780659015 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Jan 23, 2026

Test report by @boegel
FAILED
Build succeeded for 5 out of 6 (total: 34 hours 38 mins 31 secs) (3 easyconfigs in total)
node4302.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/aeb6cd33ac5a5e07c7b5039b6906a789 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 23, 2026

Test report by @boegel
FAILED
Build succeeded for 5 out of 6 (total: 37 hours 16 mins 29 secs) (3 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 590.48.01, Python 3.9.21
See https://gist.github.com/boegel/64f44a20710dd30f7c235e8a420f3b09 for a full test report.

@Flamefire
Copy link
Contributor Author

Flamefire commented Jan 23, 2026

export/test_export_opinfo (431 failed, 201 passed, 40 skipped, 0 errors)

Something's off there...

The latest (below) has 2 missing suites:

Missing: inductor/test_flex_attention, inductor/test_flex_decoding

@boegel
Copy link
Member

boegel commented Jan 23, 2026

Test report by @boegel
FAILED
Build succeeded for 5 out of 6 (total: 45 hours 50 mins 52 secs) (3 easyconfigs in total)
node3302.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/7326c174113e98762f39ea9c1da74aa9 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 4 out of 5 (total: 49 hours 21 mins 18 secs) (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.44.01, Python 3.9.23
See https://gist.github.com/boegelbot/18f3b5048f76bfd992289b7dfd84b363 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 24, 2026

Very similar result from our A100 and H100 systems. For H100:

WARNING: 438 test failures, 0 test errors (out of 269217):
        distributed/elastic/test_control_plane (1 failed, 8 passed, 0 skipped, 0 errors)
        distributed/optim/test_zero_redundancy_optimizer (2 failed, 10 passed, 30 skipped, 0 errors)
        distributed/test_store (1 failed, 40 passed, 12 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        dynamo/test_package (1 failed, 31 passed, 17 skipped, 0 errors)
        export/test_export_opinfo (431 failed, 201 passed, 40 skipped, 0 errors)
        inductor/test_fxir_backend (1 failed, 37 passed, 0 skipped, 0 errors)

So export/test_export_opinfo is the bad guy here...
We already have a patch for this test.

Similar issue on jsc-zen3:

== 2026-01-23 20:31:52,673 build_log.py:440 WARNING 436 test failures, 0 test errors (out of 269029):
	distributed/optim/test_zero_redundancy_optimizer (2 failed, 10 passed, 30 skipped, 0 errors)
	export/test_export_opinfo (431 failed, 201 passed, 40 skipped, 0 errors)
	functorch/test_ops (2 failed, 7495 passed, 2725 skipped, 0 errors)
	inductor/test_fxir_backend (1 failed, 37 passed, 0 skipped, 0 errors)

@Flamefire
Copy link
Contributor Author

So export/test_export_opinfo is the bad guy here...

Yes see also my previous comment. Can you attach the log? I assume it is a common issue that will easily fix all/most tests

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (total: 9 hours 34 mins 53 secs) (3 easyconfigs in total)
n1270.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/f845f5c9c51b5e3b91bed7ff4df881cf for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 10 hours 24 mins 14 secs) (3 easyconfigs in total)
n1195.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/7037cde389d6fd22527c67d0f2fad5d7 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 24 hours 15 mins 20 secs) (3 easyconfigs in total)
c96 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/9032201f62e9b79550a110fe7b561779 for a full test report.

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9b793dd to a02dba5 Compare February 12, 2026 15:37
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 10 hours 7 mins 8 secs) (3 easyconfigs in total)
n1366.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/2e975fd309b691ebb4b1976275e7e2c5 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants