Skip to content

{ai}[foss/2024a] PyTorch v2.9.1#25240

Merged
boegel merged 9 commits intoeasybuilders:developfrom
Flamefire:20260209112808_new_pr_PyTorch291
Feb 16, 2026
Merged

{ai}[foss/2024a] PyTorch v2.9.1#25240
boegel merged 9 commits intoeasybuilders:developfrom
Flamefire:20260209112808_new_pr_PyTorch291

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

(created using eb --new-pr)

…2.9.1_avoid-multiprocess-tests-hanging-forever.patch, PyTorch-2.9.1_fix-hypothesis-deadline.patch, PyTorch-2.9.1_fix-iteration-in-fligh-reporter.patch, PyTorch-2.9.1_fix-test_dist2-decorators.patch, PyTorch-2.9.1_ignore-warning-incompatible-pointer-types.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch, PyTorch-2.9.1_skip-tests-requiring-SM90.patch
@github-actions github-actions bot added 2024a issues & PRs related to 2024a common toolchains update labels Feb 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 9, 2026

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@verdurin
Copy link
Copy Markdown
Member

verdurin commented Feb 9, 2026

Test report by @verdurin
FAILED
Build succeeded for 9 out of 75 (total: 1 hour 23 mins 10 secs) (1 easyconfigs in total)
centos-stream-9 - Linux CentOS Stream 9, x86_64, Intel Core Processor (Skylake, IBRS), Python 3.12.12
See https://gist.github.com/verdurin/5af165d366ecbdc7c2e46da782d9d3dd for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@verdurin You ran out of space in /dev/shm

@verdurin
Copy link
Copy Markdown
Member

verdurin commented Feb 9, 2026

@verdurin You ran out of space in /dev/shm

Yes, retrying with a different buildpath.

@verdurin
Copy link
Copy Markdown
Member

verdurin commented Feb 9, 2026

Test report by @verdurin
FAILED
Build succeeded for 60 out of 66 (total: 4 hours 25 mins 39 secs) (1 easyconfigs in total)
centos-stream-9 - Linux CentOS Stream 9, x86_64, Intel Core Processor (Skylake, IBRS), Python 3.12.12
See https://gist.github.com/verdurin/680019814a54689716ddca00a1953aff for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (total: 9 hours 5 mins 34 secs) (1 easyconfigs in total)
n1042.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/5c3cec45065c57a3e52b6011b8695f33 for a full test report.

@pavelToman
Copy link
Copy Markdown
Collaborator

pavelToman commented Feb 10, 2026

Test report by @pavelToman
FAILED
Build succeeded for 0 out of 1 (total: 1 secs) (1 easyconfigs in total)
node4202.shinx.os - Linux RHEL 9.6, x86_64, AMD EPYC 9654 96-Core Processor, Python 3.9.21
See https://gist.github.com/pavelToman/e0301fb91394f06f7d8261be83c9a4b4 for a full test report.

ERROR EasyBuild encountered an error: Couldn't find file PyTorch-check-cutlass.py anywhere

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 32 hours 20 mins 39 secs) (1 easyconfigs in total)
i7149 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.9.21
See https://gist.github.com/Flamefire/7e6a16034f1a8178e9547d8f9d9072d3 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Feb 12, 2026

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=25240 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_25240 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9649

Test results coming soon (I hope)...

Details

- notification for comment with ID 3893382597 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (total: 8 hours 53 mins 27 secs) (1 easyconfigs in total)
n1362.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/1df38205748b1740b55c8b661b0d05ac for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Feb 13, 2026

Test report by @boegel
FAILED
Build succeeded for 1 out of 2 (total: 12 hours 39 mins 57 secs) (1 easyconfigs in total)
node4202.shinx.os - Linux RHEL 9.6, x86_64, AMD EPYC 9654 96-Core Processor (zen4), Python 3.9.21
See https://gist.github.com/boegel/447df4610d81c5719e3ca2b8c4285827 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@boegel That's unfortunate. Any hint in the full log about those 2:

Could not count failed tests for the following test suites/files:
test_optim (Undetected or did not run properly)
test_sparse_csr (Undetected or did not run properly)

The rest look OK-ish: dynamo/test_error_messages is known, but I let it fail (Something in PyBind11 where "PyCapsule" is now another string in a trace) and distributed/elastic/test_control_plane is also known and I had it fixed for 2.6 but didn't add the patch for the test, now done.

@boegel
Copy link
Copy Markdown
Member

boegel commented Feb 13, 2026

@boegel That's unfortunate. Any hint in the full log about those 2:

Could not count failed tests for the following test suites/files:
test_optim (Undetected or did not run properly)
test_sparse_csr (Undetected or did not run properly)

The following tests failed consistently: ['test/test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_Adagrad_cpu_float16', 'test/test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_AdamW_cpu_float16', 'test/test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_Adam_cpu_float16', 'test/test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_SGD_cpu_float16']
test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_Adagrad_cpu_float16 Got exit code -9 (SIGKILL)
test_optim.py::TestOptimRenewedCPU::test_fused_large_tensor_Adam_cpu_float16 Got exit code -9 (SIGKILL)
The following tests failed consistently: ['test/test_sparse_csr.py::TestSparseCompressedCPU::test_invalid_input_csr_large_cpu']
test_sparse_csr.py::TestSparseCompressedCPU::test_invalid_input_csr_large_cpu Got exit code -9 (SIGKILL)

This was with 16 cores, 30GB of RAM available, which may be too tight?

edit: I've kickstarted another test with more RAM available...

@Flamefire
Copy link
Copy Markdown
Contributor Author

SIGKILL does indeed sound like it was OOM killed, or Slurm memory limits exceeded.

The test_fused_large_tensor is decorated with @largeTensorTest("64GB")
It will check psutil.virtual_memory().available >= 64 * 1024**3
However this does not take cgroup limits into account, so if you have a job that doesn't request all memory from the node it will be wrong.

test_invalid_input_csr_large_cpu has @largeTensorTest("30GB", "cpu"), so similar

@boegel
Copy link
Copy Markdown
Member

boegel commented Feb 13, 2026

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (total: 12 hours 28 mins 50 secs) (1 easyconfigs in total)
node4248.shinx.os - Linux RHEL 9.6, x86_64, AMD EPYC 9654 96-Core Processor (zen4), Python 3.9.21
See https://gist.github.com/boegel/9cff4bad37695d8295a8183634b7f5d6 for a full test report.

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (total: 38 hours 19 mins 7 secs) (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.23
See https://gist.github.com/boegelbot/40de126ab3a4ae4bd98df86f72383ec3 for a full test report.

@boegel boegel added this to the next release (5.2.1) milestone Feb 16, 2026
Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Copy Markdown
Member

boegel commented Feb 16, 2026

Going in, thanks @Flamefire!

@boegel boegel merged commit 5948725 into easybuilders:develop Feb 16, 2026
8 checks passed
@Flamefire Flamefire deleted the 20260209112808_new_pr_PyTorch291 branch February 16, 2026 13:05
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 63 hours 52 mins 51 secs) (1 easyconfigs in total)
i8012 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/e03864093f3cf3679507459ad950d557 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (total: 36 hours 33 mins 33 secs) (1 easyconfigs in total)
c26 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/859f63b1d89886c0a258fbe49a2d6961 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants