Skip to content

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1#24926

Open
Flamefire wants to merge 23 commits intoeasybuilders:developfrom
Flamefire:20251218180340_new_pr_parameterized090
Open

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1#24926
Flamefire wants to merge 23 commits intoeasybuilders:developfrom
Flamefire:20251218180340_new_pr_parameterized090

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Dec 18, 2025

(created using eb --new-pr)

Includes:

It makes sense to merge #24365 first as any changes there need to be reflected here. But this allows testing both in parallel

…tests-0.15.0-GCCcore-14.3.0.eb, PyTorch-2.9.1-foss-2025b-CUDA-12.9.1.eb, unittest-xml-reporting-3.2.0-GCCcore-14.3.0.eb and patches: PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.6.0_show-test-duration.patch, PyTorch-2.6.0_skip-test_segfault.patch, PyTorch-2.7.0_avoid_caffe2_test_cpp_jit.patch, PyTorch-2.7.1_avoid-caffe2-sandcastle-test-lib.patch, PyTorch-2.7.1_skip-test_data_parallel_rnn.patch, PyTorch-2.7.1_skip-test_gds_fails_in_ci.patch, PyTorch-2.7.1_skip-test_mixed_mm_exhaustive_dtypes.patch, PyTorch-2.7.1_skip-tests-requiring-SM90.patch, PyTorch-2.7.1_suport-64bit-BARs.patch, PyTorch-2.7.1_tolerance-test_partial_flat_weights.patch, PyTorch-2.9.0_disable-test_nan_assert.patch, PyTorch-2.9.0_enable-symbolizer-in-test_workspace_allocation_error.patch, PyTorch-2.9.0_fix-attention-squeeze.patch, PyTorch-2.9.0_fix-FP16-CPU-tests-in-test_torchinductor_opinfo.patch, PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_fix-test_exclude_padding.patch, PyTorch-2.9.0_fix-test_version_error.patch, PyTorch-2.9.0_honor-XDG_CACHE_HOME.patch, PyTorch-2.9.0_increase-tolerance-in-test_transformers.patch, PyTorch-2.9.0_remove-faulty-close.patch, PyTorch-2.9.0_revert-pybind11-3-change.patch, PyTorch-2.9.0_skip-test_benchmark_on_non_zero_device.patch, PyTorch-2.9.0_skip-test_convolution1-on-H100.patch, PyTorch-2.9.0_skip-test_inductor_all_gather_into_tensor_coalesced.patch, PyTorch-2.9.0_skip-test_original_aten_preserved_pad_mm.patch, PyTorch-2.9.0_skip-test_override-without-CUDA.patch, PyTorch-2.9.0_skip-test_unbacked_reduction.patch, PyTorch-2.9.0_skip-tests-requiring-CUDA-12.8.patch, PyTorch-2.9.0_skip-unexpected-success-in-test_fake_export.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch
@github-actions github-actions bot added 2025b issues & PRs related to 2025b common toolchains update labels Dec 18, 2025
@github-actions
Copy link

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre

This comment was marked as outdated.

@Thyre

This comment was marked as resolved.

@Flamefire
Copy link
Contributor Author

Test report by @Thyre FAILED Build succeeded for 3 out of 4 (total: 55 secs) (4 easyconfigs in total) jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21 See https://gist.github.com/Thyre/576f0dbeceb975733d860d97f16ca3fc for a full test report.

== 2025-12-19 10:19:27,773 build_log.py:233 ERROR EasyBuild encountered an error: Nothing found to replace 'if IS_CI:\n\s+# Add the option to generate XML test report.*' in test/run_test.py (at easybuild/tools/filetools.py:1861 in apply_regex_substitutions)

Are you using the latest easyblock? It is missing this commit from easybuilders/easybuild-easyblocks#3803

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

2025b is using GCC 14 that has new warnings. See pytorch/pytorch#166873

Patch added. Seems to only affect ARM

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

Oh, it is a C file. Updated the patch to also add it to C-flags

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

Looks like I need to set those values earlier. Can you try again?

@Thyre
Copy link
Collaborator

Thyre commented Dec 19, 2025

Actual failure was an internal GCC compiler error:

In file included from /dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/build/aten/src/ATen/native/cpu/Unfold2d.cpp.SVE256.cpp:1:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp: In function ‘void at::native::{anonymous}::unfolded2d_acc_kernel(c10::ScalarType, void*, void*, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool)’:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: error: unrecognizable insn:
  225 | }
      | ^
(insn 1375 1374 1376 99 (set (reg:VNx16BI 3253)
        (unspec:VNx16BI [
                (reg:VNx16BI 3250)
                (reg:VNx8BI 3252)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/torch/headeronly/util/bit_cast.h":40:14 -1
     (nil))
during RTL pass: vregs
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: internal compiler error: in extract_insn, at recog.cc:2812
0x7d30df _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:108
0x7d3113 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:116
0xec1d17 extract_insn(rtx_insn*)
	../../gcc/recog.cc:2812
0xc2a28b instantiate_virtual_regs_in_insn
	../../gcc/function.cc:1612
0xc2a28b instantiate_virtual_regs
	../../gcc/function.cc:1995
0xc2a28b execute
	../../gcc/function.cc:2042
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Test report by @Thyre
FAILED
Build succeeded for 3 out of 4 (total: 17 mins 7 secs) (4 easyconfigs in total)
jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21
See https://gist.github.com/Thyre/bdc1ee06d4f8b430f52f9c220b66e11f for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Dec 19, 2025

@Flamefire
Copy link
Contributor Author

Flamefire commented Dec 19, 2025

Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027

There was a PR which should have worked around this, but seemingly the fix doesn't work? See also:

* https://github.com/pytorch/pytorch/blob/f026b098e4319413db7d3fc1dbcb39dda69fcf0c/aten/src/ATen/native/cpu/Unfold2d.cpp#L172

* [Build error: unrecognizable insn with using gcc-14 on aarch64 pytorch/pytorch#157842](https://github.com/pytorch/pytorch/issues/157842)

That is not included in this (or any) release yet. I'll add it to the patch list

Maybe we need to patch GCCcore/14.3.0 with this change? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027#c9

Would be an option, not sure if it is worth it: This EC is included since EB 5.1.0, although we did that in the past

@Flamefire Flamefire changed the title {tools}[GCCcore/14.3.0] parameterized v0.9.0, pytest-subtests v0.15.0, PyTorch v2.9.1, ... w/ CUDA 12.9.1 {tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 Dec 19, 2025
@boegel boegel added this to the next release (5.2.1?) milestone Dec 31, 2025
@github-actions github-actions bot added the 2024a issues & PRs related to 2024a common toolchains label Jan 15, 2026
@github-actions github-actions bot removed the 2024a issues & PRs related to 2024a common toolchains label Jan 15, 2026
@Thyre
Copy link
Collaborator

Thyre commented Jan 19, 2026

We might be able to work around the ICE on aarch64 by just fixing a bit of broken code in PyTorch.
This macro code always evaluates to 0, because neither <version> nor <bit> is included before checking the feature macro:

https://github.com/pytorch/pytorch/blob/d38164a545b4a4e4e0cf73ce67173f70574890b6/torch/headeronly/util/bit_cast.h#L7C1-L14C74

This is the part that causes the ICE, so if we just use the GCC implementation, which is available in GCC 14, we might get further. I haven't let the full build run with after including the header (and used the GCC from #25090), so I cannot completely confirm if that's sufficient.

Edit: Unfortunately not, as this would require C++20. PyTorch uses C++17 😕

@Thyre
Copy link
Collaborator

Thyre commented Feb 10, 2026

@Flamefire, can you add pytorch/pytorch@8fd5093 to this PR? Hopefully we get a bit further on aarch64 then...

@Flamefire
Copy link
Contributor Author

Done. Added to all 3 PRs

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 3 out of 4 (total: 26 hours 31 mins 47 secs) (4 easyconfigs in total)
c49 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/ecd32fab1a260714517c94e22d14d6af for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Feb 11, 2026

Test report by @Thyre
FAILED
Build succeeded for 3 out of 4 (total: 21 hours 56 mins 37 secs) (4 easyconfigs in total)
jrc0901.jureca - Linux Rocky Linux 9.7, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 590.48.01, Python 3.9.25
See https://gist.github.com/Thyre/74b4698312b58c90063c9a54a892c21e for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Feb 11, 2026

Looks like a significant number of float32 tests failed on aarch64 / Neoverse V2. Looks like something like this PR might be related? pytorch/pytorch#169937

Still not merged though 😕

@Flamefire
Copy link
Contributor Author

@Thyre

export/test_export_opinfo (431 failed, 201 passed, 40 skipped, 0 errors)
inductor/test_cpu_repro (4 failed, 211 passed, 526 skipped, 0 errors)
inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)
inductor/test_cutlass_backend (8 failed, 142 passed, 2 skipped, 0 errors)
inductor/test_flex_attention (258 failed, 12 passed, 306 skipped, 0 errors)

I've seen issues with test_flex_attention including segfaults.
And @boegel has seen most of the failures in one report in export/test_export_opinfo too.

Looks like a significant number of float32 tests failed on aarch64 / Neoverse V2. Looks like something like this PR might be related? pytorch/pytorch#169937

Can you show a couple such failures, possibly grouped if they look the same?

@Thyre
Copy link
Collaborator

Thyre commented Feb 11, 2026

I caught this from the snippet in the output log here: https://gist.github.com/Thyre/25e6d77bf7117f14c118bf0dd1a3e70f#file-pytorch-2-9-1-foss-2025b-cuda-12-9-1_partial-log-L449

I'll need to check if I still have the full log available. In the worst case, I have to re-run the build 🙈

@Flamefire
Copy link
Contributor Author

Flamefire commented Feb 11, 2026

I see, yes the float32 issue sticks out. However I don't think that PyTorch MR is related: That deals with invalid conversions such as static_cast<uint8_t>(float(-2)) which are undefined and basically a usage error.

However I hope those failures have a single, common cause. If we find that it would fix >700 tests at once :-)

Maybe cross-check with the PYPI package:

  • Load the same Python module
  • Create a virtual env
  • pip install torch==2.9.1 expecttest numpy
  • Extract pytorch 2.9.1 archive and cd to test folder
  • python run_test.py --pipe-logs --verbose --continue-through-error -i inductor/test_torchinductor_opinfo inductor/test_flex_attention inductor/test_cpu_cpp_wrapper

@Thyre
Copy link
Collaborator

Thyre commented Feb 12, 2026

/usr/bin/bash: line 1: /tmp/eb-qpd30m2u/files_pr24926/p/PyTorch/PyTorch-check-cutlass.py: Permission denied

Argh, just wanted to have a test installation (skipping the tests) to better inspect the failing tests 😕


Test report by @Thyre
FAILED
Build succeeded for 3 out of 4 (total: 1 hour 10 mins 21 secs) (4 easyconfigs in total)
jrc0900.jureca - Linux Rocky Linux 9.7, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 590.48.01, Python 3.9.25
See https://gist.github.com/Thyre/b0650190581947115ed40fa4d92f73f7 for a full test report.

@Thyre

This comment was marked as off-topic.

@Thyre
Copy link
Collaborator

Thyre commented Feb 12, 2026

The float32 failures on GH200 seem to occur because of missing support:

  File "/p/project1/cswmanage/reuter1/EasyBuild/jedi/apps/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib/python3.13/site-packages/torch/_inductor/kernel/flex/flex_cpu.py", line 76, in lower_cpu
    raise NotImplementedError(
        "torch.compile on current platform is not supported for CPU."
    )
torch._inductor.exc.InductorError: LoweringException: NotImplementedError: torch.compile on current platform is not supported for CPU.

Great that they do not skip those tests ...

@Thyre
Copy link
Collaborator

Thyre commented Feb 12, 2026

For CUTLASS, I see some tests failing with e.g.:

FAILED [0.7841s] inductor/test_cutlass_backend.py::TestCutlassBackend::test_compilation_time_use_aoti_False - torch._inductor.exc.InductorError: NoValidChoicesError:

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


To execute this test, run the following from the base repo dir:
    python test/inductor/test_cutlass_backend.py TestCutlassBackend.test_compilation_time_use_aoti_False

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

@Thyre
Copy link
Collaborator

Thyre commented Feb 12, 2026

test_cpu_select_algorithm gives us actual failures with incorrect results, especially with bfloat16, all in the form of:

FAILED [0.5882s] inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_int8_woq_mm_batch_size_17_mid_dim_1_in_features_1024_out_features_64_cpu_bfloat16 - AssertionError: Scalars are not equal!

Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0

To execute this test, run the following from the base repo dir:
    python test/inductor/test_cpu_select_algorithm.py TestSelectAlgorithmCPU.test_int8_woq_mm_batch_size_17_mid_dim_1_in_features_1024_out_features_64_cpu_bfloat16

@Flamefire
Copy link
Contributor Author

/usr/bin/bash: line 1: /tmp/eb-qpd30m2u/files_pr24926/p/PyTorch/PyTorch-check-cutlass.py: Permission denied

Argh, just wanted to have a test installation (skipping the tests) to better inspect the failing tests 😕

I guess we want to make test cases executable: easybuilders/easybuild-framework#5118

test_cpu_select_algorithm gives us actual failures with incorrect results, especially with bfloat16, all in the form of:

That might be checking counters of usages that don't happen on non-AVX2. I guess e.g. at self.assertEqual(counters["inductor"]["cpp_templated_kernel_counter"], 1)
Can you attach/send the log for me to have a closer look? I remember they had some skip-markers for that in some places

@Thyre
Copy link
Collaborator

Thyre commented Feb 12, 2026

Can you attach/send the log for me to have a closer look? I remember they had some skip-markers for that in some places

I unfortunately stopped the tests before the run fully finished, but I can redo them. Hopefully I manage to do that before my vacation.

@Flamefire
Copy link
Contributor Author

I've seen issues with test_flex_attention including segfaults.
And @boegel has seen most of the failures in one report in export/test_export_opinfo too.

test_export_opinfo fails when there is exactly 1 GPU.
Both tests should be fixed with latest patches.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 3 out of 4 (total: 10 hours 23 mins 14 secs) (4 easyconfigs in total)
n1026.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/fa1ca5d23ad4e6c2566fcd58abc3fefc for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025b issues & PRs related to 2025b common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants