{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1#24926
Conversation
…tests-0.15.0-GCCcore-14.3.0.eb, PyTorch-2.9.1-foss-2025b-CUDA-12.9.1.eb, unittest-xml-reporting-3.2.0-GCCcore-14.3.0.eb and patches: PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.6.0_show-test-duration.patch, PyTorch-2.6.0_skip-test_segfault.patch, PyTorch-2.7.0_avoid_caffe2_test_cpp_jit.patch, PyTorch-2.7.1_avoid-caffe2-sandcastle-test-lib.patch, PyTorch-2.7.1_skip-test_data_parallel_rnn.patch, PyTorch-2.7.1_skip-test_gds_fails_in_ci.patch, PyTorch-2.7.1_skip-test_mixed_mm_exhaustive_dtypes.patch, PyTorch-2.7.1_skip-tests-requiring-SM90.patch, PyTorch-2.7.1_suport-64bit-BARs.patch, PyTorch-2.7.1_tolerance-test_partial_flat_weights.patch, PyTorch-2.9.0_disable-test_nan_assert.patch, PyTorch-2.9.0_enable-symbolizer-in-test_workspace_allocation_error.patch, PyTorch-2.9.0_fix-attention-squeeze.patch, PyTorch-2.9.0_fix-FP16-CPU-tests-in-test_torchinductor_opinfo.patch, PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_fix-test_exclude_padding.patch, PyTorch-2.9.0_fix-test_version_error.patch, PyTorch-2.9.0_honor-XDG_CACHE_HOME.patch, PyTorch-2.9.0_increase-tolerance-in-test_transformers.patch, PyTorch-2.9.0_remove-faulty-close.patch, PyTorch-2.9.0_revert-pybind11-3-change.patch, PyTorch-2.9.0_skip-test_benchmark_on_non_zero_device.patch, PyTorch-2.9.0_skip-test_convolution1-on-H100.patch, PyTorch-2.9.0_skip-test_inductor_all_gather_into_tensor_coalesced.patch, PyTorch-2.9.0_skip-test_original_aten_preserved_pad_mm.patch, PyTorch-2.9.0_skip-test_override-without-CUDA.patch, PyTorch-2.9.0_skip-test_unbacked_reduction.patch, PyTorch-2.9.0_skip-tests-requiring-CUDA-12.8.patch, PyTorch-2.9.0_skip-unexpected-success-in-test_fake_export.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch
|
Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
Are you using the latest easyblock? It is missing this commit from easybuilders/easybuild-easyblocks#3803 |
This comment was marked as outdated.
This comment was marked as outdated.
|
2025b is using GCC 14 that has new warnings. See pytorch/pytorch#166873 Patch added. Seems to only affect ARM |
This comment was marked as outdated.
This comment was marked as outdated.
|
Oh, it is a C file. Updated the patch to also add it to C-flags |
This comment was marked as outdated.
This comment was marked as outdated.
|
Looks like I need to set those values earlier. Can you try again? |
|
Actual failure was an internal GCC compiler error: Test report by @Thyre |
|
Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027 There was a PR which should have worked around this, but seemingly the fix doesn't work?
Maybe we need to patch |
That is not included in this (or any) release yet. I'll add it to the patch list
Would be an option, not sure if it is worth it: This EC is included since EB 5.1.0, although we did that in the past |
|
We might be able to work around the ICE on aarch64 by just fixing a bit of broken code in PyTorch. This is the part that causes the ICE, so if we just use the GCC implementation, which is available in GCC 14, we might get further. I haven't let the full build run with after including the header (and used the GCC from #25090), so I cannot completely confirm if that's sufficient. Edit: Unfortunately not, as this would require C++20. PyTorch uses C++17 😕 |
|
@Flamefire, can you add pytorch/pytorch@8fd5093 to this PR? Hopefully we get a bit further on aarch64 then... |
|
Done. Added to all 3 PRs |
|
Test report by @Flamefire |
|
Test report by @Thyre |
|
Looks like a significant number of Still not merged though 😕 |
I've seen issues with
Can you show a couple such failures, possibly grouped if they look the same? |
|
I caught this from the snippet in the output log here: https://gist.github.com/Thyre/25e6d77bf7117f14c118bf0dd1a3e70f#file-pytorch-2-9-1-foss-2025b-cuda-12-9-1_partial-log-L449 I'll need to check if I still have the full log available. In the worst case, I have to re-run the build 🙈 |
|
I see, yes the float32 issue sticks out. However I don't think that PyTorch MR is related: That deals with invalid conversions such as However I hope those failures have a single, common cause. If we find that it would fix >700 tests at once :-) Maybe cross-check with the PYPI package:
|
|
Argh, just wanted to have a test installation (skipping the tests) to better inspect the failing tests 😕 Test report by @Thyre |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
The Great that they do not skip those tests ... |
|
For CUTLASS, I see some tests failing with e.g.: |
|
|
I guess we want to make test cases executable: easybuilders/easybuild-framework#5118
That might be checking counters of usages that don't happen on non-AVX2. I guess e.g. at |
I unfortunately stopped the tests before the run fully finished, but I can redo them. Hopefully I manage to do that before my vacation. |
|
|
Test report by @Flamefire |
(created using
eb --new-pr)Includes:
It makes sense to merge #24365 first as any changes there need to be reflected here. But this allows testing both in parallel