{ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 by jfgrimm · Pull Request #19666 · easybuilders/easybuild-easyconfigs

jfgrimm · 2024-01-22T12:16:41Z

(created using eb --new-pr)

haven't actually run the tests yet to see how many fail

lexming

I'm also working on this and made two additional patches to workaround failing tests on our GPU nodes. Please check jfgrimm#3

…atch Two more patches for PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

jfgrimm · 2024-01-24T11:34:02Z

@lexming thanks!

jfgrimm · 2024-01-24T23:26:00Z

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu22.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 2 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/4d87e654aa4fb984e0026ec049fa6ace for a full test report.

Test failure ignored: '742 test failures, 82 test errors (out of 202417)

casparvl · 2024-01-27T11:52:13Z

Test report by @casparvl
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/5ddd19d75644b6b53980b1fcd27a6d2f for a full test report.

From stdout:

WARNING: Test failure ignored: '15 test failures, 0 test errors (out of 211173)

From the EasyBuild log:

test_jit 1/1 failed!
distributed/_tensor/test_dtensor_ops 1/1 failed!
distributed/checkpoint/test_fsdp_optim_state 1/1 failed!
distributed/fsdp/test_fsdp_tp_integration 1/1 failed!
distributed/fsdp/test_shard_utils 1/1 failed!
distributed/tensor/parallel/test_tp_random_state 1/1 failed!
test_cpp_extensions_aot_ninja 1/1 failed!
test_cpp_extensions_aot_no_ninja 1/1 failed!
test_fake_tensor 1/1 failed!
test_jit_legacy 1/1 failed!
test_jit_profiling 1/1 failed!
test_nn 1/1 failed!

lexming · 2024-01-29T09:15:25Z

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node406.hydra.os - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7282 16-Core Processor @ 2.80GHz, 1 x NVIDIA NVIDIA A100-PCIe-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/lexming/c89fa474ef534acc3cd170bcff5c7aca for a full test report.

== 2024-01-27 00:21:20,756 pytorch.py:467 WARNING 4 test failures, 0 test errors (out of 211207):
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)

Failed test_nn is TestNN::test_Conv1d_pad_same_cuda_tf32 due to too tight error tolerance. Seems minor.
Failed test_dtensor_ops is TestDTensorOpsCPU::test_dtensor_op_db_nn_functional_pad_circular_cpu_float32 shows much uglier errors·. See https://gist.github.com/lexming/252f06cb5a2642082ba634dbe841f7f3 . I could not find any related issue though. The only positive is that it is a single test of the dtensor suite, so my guess is that this test is just bad and there is no fundamental flaw in the code.

lexming · 2024-01-29T09:39:23Z

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node252.hydra.os - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA P100-PCIe, 545.23.08, Python 3.6.8
See https://gist.github.com/lexming/576add53a0b1c736da9137bd25c6a966 for a full test report.

== 2024-01-25 01:57:32,190 pytorch.py:467 WARNING 4 test failures, 0 test errors (out of 210938):
functorch/test_eager_transforms 1/1 (1 failed, 343 passed, 3 skipped, 1 xfailed, 2 rerun)
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)

Failed test_eager_transforms is TestCompileTransformsCPU::test_compile_vmap_hessian_cpu with an unexpected success. Seems an implementation error failing to trigger some deprecation warnings. Very minor issue.
Failed test_dtensor_ops is TestDTensorOpsCPU::test_dtensor_op_db_nn_functional_pad_circular_cpu_float32, same story as with our A100 on AMD zen2. See {ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 #19666 (comment)

lexming

On my side with just 4 test failures, this looks pretty good to me already.

boegel · 2024-01-29T15:22:15Z

@Flamefire Any input on this?

jfgrimm · 2024-01-29T16:57:03Z

haven't had a chance to look into the 742 test failures, 82 test errors with 2x H100, zen3 yet

jfgrimm · 2024-01-29T22:18:52Z

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu13.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7643 48-Core Processor, 3 x NVIDIA NVIDIA A40, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/29713f8c8160dff849eb5301a5f26fd3 for a full test report.

740 test failures, 82 test errors (out of 202410)

jfgrimm · 2024-01-31T09:23:50Z

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu21.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/814d5b4c5870b81f8757e7f5398bcba8 for a full test report.

27 test failures, 78 test errors (out of 210834)
from the log:

Failed tests (suites/files):
  test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
  distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
  distributed/fsdp/test_fsdp_flatten_params 1/1 (7 failed, 4 passed, 15 rerun)
  distributed/fsdp/test_fsdp_input 1/1 (1 failed, 1 passed, 4 rerun)
  distributed/fsdp/test_fsdp_mixed_precision 1/1 (2 failed, 2 passed, 61 skipped, 6 rerun)
  distributed/fsdp/test_fsdp_unshard_params 1/1 (1 failed, 1 passed, 12 skipped, 2 rerun)
  distributed/optim/test_zero_redundancy_optimizer 1/1 (3 failed, 9 passed, 30 skipped, 9 rerun)
  distributed/pipeline/sync/skip/test_gpipe 1/1 (13 errors, 26 rerun)
  distributed/pipeline/sync/skip/test_leak 1/1 (8 errors, 16 rerun)
  distributed/pipeline/sync/test_bugs 1/1 (1 skipped, 3 errors, 6 rerun)
  distributed/pipeline/sync/test_inplace 1/1 (2 xfailed, 1 error, 2 rerun)
  distributed/pipeline/sync/test_pipe 1/1 (1 passed, 3 skipped, 52 errors, 104 rerun)
  distributed/pipeline/sync/test_transparency 1/1 (1 error, 2 rerun)
  test_fake_tensor 1/1 (2 failed, 87 passed, 1 xfailed, 4 rerun)
  test_nn 1/1 (1 failed, 2789 passed, 137 skipped, 3 xfailed, 2 rerun)
  distributed/rpc/cuda/test_tensorpipe_agent 1/1 (1 unit test(s) failed)
  distributed/rpc/test_faulty_agent 1/1 (1 unit test(s) failed)
  distributed/rpc/test_share_memory 1/1 (1 unit test(s) failed)
  distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
  distributed/test_store 1/1 (1 unit test(s) failed)
  distributed/fsdp/test_wrap 1/1
  distributed/test_dynamo_distributed 1/1
  distributed/test_inductor_collectives 1/1'

boegel · 2024-02-01T04:47:41Z

Test report by @boegel
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 535.154.05, Python 3.6.8
See https://gist.github.com/boegel/3ce3ca8e311f6ced76a39f1c59785126 for a full test report.

edit:

5 test failures, 0 test errors (out of 210904)

Failed tests (suites/files):
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
test_fake_tensor 1/1 (2 failed, 86 passed, 1 skipped, 1 xfailed, 4 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1

boegel · 2024-02-01T15:28:58Z

So, test results across a variety of systems summarized:

on AMD Milan (EPYC 7413) with 2x NVIDIA H100 (viking @ York): 742 test failures, 82 test errors
on Intel Ice Lake with 4x A100 (40GB) (snellius @ SURF): 15 test failures, 0 test errors
on AMD Rome with 1x NVIDIA A100 (hydra @ VUB): 4 test failures, 0 test errors
on Intel Sandy Bridge with 1x NVIDIA P100 (hydra @ VUB): 4 test failures, 0 test errors
on AMD Milan (EPYC 7643) with 3x NVIDIA A40 (viking @ York): 740 test failures, 82 test errors
on AMD Milan (EPYC 7413) with 1x NVIDIA H100 (viking @ York): 27 test failures, 78 test errors
on Intel Cascade Lake with 1 V100 (joltik @ HPC-UGent): 5 test failures

So even in the worst case, 99.6% of all tests pass...

@jfgrimm I'm strongly in favor of setting max_failed_tests = 20 in this easyconfig including a reference to this comment, and then merging it. Dealing with the failing tests can be done in subsequent PRs, if it's worth the trouble...

jfgrimm · 2024-02-01T15:42:55Z

ohhhh damn I forgot to add cuDNN, magma and NCCL... 🤦

jfgrimm · 2024-02-21T10:42:41Z

@lexming done, I'll open an issue to revisit the failed tests

#19946

lexming

LGTM (yet again)

@casparvl I'll merge this now, if there would be any other changes needed to fix your build issues, it can be done in a follow-up PR.

lexming · 2024-02-21T21:28:41Z

Merging, thanks @jfgrimm and @Flamefire and all the testers!

casparvl · 2024-02-23T13:01:38Z

Still no idea why my builds are failing... The only thing I keep seeing consistently is

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

but no clue why. If anyone has ideas, I'd love to hear :D Where there fixes to the toolchain at somepoint? Maybe I should try and recompile that...

Anyway, good to get this merged, no point in making it wait for something that is clearly specific to my machine :)

lexming · 2024-02-27T13:18:12Z

@boegelbot please test @ jsc-zen3
EB_ARGS="--ignore-test-failure"

boegelbot · 2024-02-27T13:20:08Z

@lexming: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=19666 EB_ARGS="--ignore-test-failure" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_19666 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

exit code: 0
output:

Submitted batch job 3681

Test results coming soon (I hope)...

Details

- notification for comment with ID 1966533488 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2024-02-27T22:48:56Z

Test report by @boegelbot
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/17acbb2d3429ba625892079f4850409d for a full test report.

emdrago · 2024-02-27T23:18:52Z

Test report by @emdrago
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gput052 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA GeForce RTX 2080 Ti, 535.54.03, Python 3.9.13
See https://gist.github.com/emdrago/3729452d9411286fe8d15bd627eae272 for a full test report.

lexming · 2024-02-28T08:24:13Z

@boegelbot please test @ generoso
EB_ARGS="--ignore-test-failure"

boegelbot · 2024-02-28T08:25:12Z

@lexming: Request for testing this PR well received on login1

PR test command 'EB_PR=19666 EB_ARGS="--ignore-test-failure" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19666 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 12984

Test results coming soon (I hope)...

Details

- notification for comment with ID 1968459391 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2024-02-28T21:10:21Z

Test report by @boegelbot
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/f040f52bccc71a5fa2891ccd853806da for a full test report.

Flamefire · 2024-03-02T20:29:08Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
i8013 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/84b9710d5f445175d4b92c7d6d7f4f69 for a full test report.

schiotz · 2024-03-12T10:23:39Z

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

Flamefire · 2024-03-12T13:16:38Z

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

If it doesn't say anything about allowed failures in the output then this is a different issue. E.g. in my test report above I have:

ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): 7 test failures, 0 test errors (out of 211121):
Failed tests (suites/files):
distributed/checkpoint/test_fsdp_optim_state 1/1 (2 failed, 4 rerun)
distributed/fsdp/test_fsdp_tp_integration 1/1 (1 failed, 4 passed, 2 rerun)
distributed/fsdp/test_shard_utils 1/1 (2 failed, 1 passed, 4 rerun)
distributed/tensor/parallel/test_tp_random_state 1/1 (1 failed, 2 rerun)
test_nn 1/1 (1 failed, 2804 passed, 122 skipped, 3 xfailed, 2 rerun)
+ test_torch 1/1 (at easybuild/easyblocks/p/pytorch.py:507 in test_step)

The + prefix in test_torch is a hint that this test was not accounted in the failures. In fact this test fails entirely and crashes on my machine which is why there is no test failure count information. To be safe the run is considered a failure without comparing against allows failures. We should be more explicit in the log messages here.

schiotz · 2024-03-12T15:02:53Z

Thank you, @Flamefire

We are having

== 2024-03-11 23:32:29,846 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): 9 test failures, 0 test errors (out of 211087):
Failed tests (suites/files):
inductor/test_compiled_autograd 1/1 (1 failed, 130 passed, 114 skipped, 2 rerun)
distributed/checkpoint/test_fsdp_optim_state 1/1 (2 failed, 4 rerun)
distributed/fsdp/test_fsdp_tp_integration 1/1 (1 failed, 4 passed, 2 rerun)
distributed/fsdp/test_shard_utils 1/1 (2 failed, 1 passed, 4 rerun)
distributed/tensor/parallel/test_tp_random_state 1/1 (1 failed, 2 rerun)
test_nn 1/1 (2 failed, 2803 passed, 122 skipped, 3 xfailed, 4 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1 (at easybuild/easyblocks/p/pytorch.py:451 in test_step)
== 2024-03-11 23:32:29,847 build_log.py:267 INFO ... (took 9 hours 44 mins 59 secs)

Now we are trying to install with --ignore-test-failure, I don't know if that will then fail, or if doing so is a bad idea.

Flamefire · 2024-03-12T16:12:07Z

Yes, so test_cpp_extensions_* is failing/crashing. This sounds like something is wrong in your setup (e.g. broken/incompatible Ninja module) Check the log for details (search for e.g. "test_cpp_extensions_aot_ninja" from the bottom up) and then (try to) decide if --ignore-test-failure is a bad idea ;-) It should pass the installation however.

schiotz · 2024-03-13T08:29:25Z

@Flamefire

Thanks for your help. It is really strange, it looks like nvcc fails to compile some C++ template code. Could there be some recent update to something that we have missed, e.g. if we have installed Ninja too early?

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /home/modules/software/CUDA/12.1.1/bin/nvcc  -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include -I/tmp/
eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3
.11/site-packages/torch/include/TH -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/THC -I/home/modules/sof
tware/CUDA/12.1.1/include -I/dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_d
irs_test -I/home/modules/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c -c /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.
1/pytorch-v2.1.2/test/cpp_extensions/torch_library.cu -o /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_exten
sions/build/temp.linux-x86_64-cpython-311/torch_library.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BF
LOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INC
LUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXT
ENSION_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
FAILED: /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/build/temp.linux-x86_64-cpython-311/torch_l
ibrary.o 
/home/modules/software/CUDA/12.1.1/bin/nvcc  -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include -I/tmp/eb-n8l
n4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/si
te-packages/torch/include/TH -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/THC -I/home/modules/software/
CUDA/12.1.1/include -I/dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_dirs_te
st -I/home/modules/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c -c /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pyto
rch-v2.1.2/test/cpp_extensions/torch_library.cu -o /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/
build/temp.linux-x86_64-cpython-311/torch_library.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16
_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INCLUDE_E
XTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXTENSION
_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h: In function typename pybind11::detail::type
_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&):
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:120: error: expected template-name before
 < token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:120: error: expected identifier before < 
token
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression b
efore > token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression b
efore ) token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                             
 ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/home/modules/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

Flamefire · 2024-03-13T08:47:49Z

Yep, that rings a bell for me. And is actually a good reason for why I'm so eager to investigate test failures instead of dismissing them just because many/most are bad code. This is an issue with the pybind11 version we use (pybind11/2.11.1-GCCcore-12.3.0)
I added a patch to fix that.
Try to rebuild your pybind11 from #19047 (included in EB 4.8.2+)

The relevant patch is https://github.com/easybuilders/easybuild-easyconfigs/blob/f9d0188424dd253a0718af1c61ba002f66072b46/easybuild/easyconfigs/p/pybind11/pybind11-2.10.3_fix-nvcc-compat.patch

I suppose you could also apply that to your installed pybind11 module manually

Flamefire · 2024-03-13T10:06:36Z

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

I opened easybuilders/easybuild-easyblocks#3255 to improve the message in this case

casparvl · 2024-03-14T08:15:27Z

Just to let you both know: I hit the same issue as @schiotz that I thought the max_test_failures was ignored. Also got the output:

Failed tests (suites/files):
test_quantization 1/1 (4 failed, 1043 passed, 57 skipped, 8 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1

So I'll try to rebuild pybind11, see if that fixes things for me. Thanks for that hint @Flamefire

I think improving the error message is indeed a good idea. Your explanation here

The + prefix in test_torch is a hint that this test was not accounted in the failures. In fact this test fails entirely and crashes on my machine which is why there is no test failure count information. To be safe the run is considered a failure without comparing against allows failures. We should be more explicit in the log messages here.

Is very clear, but without it the + symbol doesn't mean much, making it non-intuitive why the installation is not proceeding. Thus, https://github.com/easybuilders/easybuild-easyblocks/pull/3255/files sounds very useful to me :)

schiotz · 2024-03-14T09:22:58Z

@casparvl @Flamefire

After rebuilding pybind11, PyTorch compiled without issues (finished during the night).

Thank you very much @Flamefire, we had built pybind11 a few weeks before your patch to it.

emdrago · 2024-04-24T22:25:36Z

Test report by @emdrago
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gput052 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA GeForce RTX 2080 Ti, 535.54.03, Python 3.9.18
See https://gist.github.com/emdrago/e864743fc5bc6ac8f04f3453afe8a0e6 for a full test report.

adding easyconfigs: PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

8d0677d

jfgrimm added the update label Jan 22, 2024

jfgrimm marked this pull request as draft January 22, 2024 12:19

This was referenced Jan 22, 2024

{bio,chem}[foss/2023a] ESM-2 v2.0.0, topaz v0.2.5.20231120 w/ CUDA 12.1.1 #19667

Merged

{tools}[foss/2023a] Faiss v1.7.4, t-SNE-CUDA v3.0.1 w/ CUDA 12.1.1 #19669

Merged

{bio}[foss/2023a] RELION v5.0.0 w/ CUDA 12.1.1 #19678

Merged

ThomasHoffmann77 mentioned this pull request Jan 23, 2024

{bio}[foss/2023a] pyTME v0.2.1, pyFFTW v0.13.1, PyWavelets v1.5.0 /w CUDA 12.1.1 #19684

Open

2 tasks

lexming added 2 commits January 23, 2024 13:08

add patch PyTorch-2.1.0_skip-test-linalg-svd-complex.patch

c95fda7

add patch PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch

df898b6

lexming mentioned this pull request Jan 23, 2024

Two more patches for PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb jfgrimm/easybuild-easyconfigs#3

Merged

lexming requested changes Jan 23, 2024

View reviewed changes

Merge pull request #3 from lexming/20240122121636_new_pr_PyTorch212_p…

1ce12d9

…atch Two more patches for PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

lexming previously approved these changes Jan 29, 2024

View reviewed changes

lexming mentioned this pull request Jan 29, 2024

{vis}[foss/2023a] MONAI v1.3.0, torchvision v0.16.0, PyTorch-Ignite v0.4.13 w/ CUDA 12.1.1 #19740

Open

2 tasks

boegel added this to the 4.x milestone Jan 31, 2024

jfgrimm marked this pull request as ready for review February 1, 2024 11:32

add cudnn, magma, nccl

969ef92

jfgrimm dismissed lexming’s stale review via 969ef92 February 1, 2024 15:46

jfgrimm dismissed lexming’s stale review via 6d5d1f3 February 21, 2024 10:42

increase max_failed_tests to 50

6d5d1f3

lexming approved these changes Feb 21, 2024

View reviewed changes

lexming merged commit 2d588e8 into easybuilders:develop Feb 21, 2024

lexming modified the milestones: 4.x, release after 4.9.0 Feb 21, 2024

Flamefire mentioned this pull request Mar 13, 2024

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason easybuilders/easybuild-easyblocks#3255

Merged

Flamefire mentioned this pull request Mar 25, 2024

Fix some more tests in PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb #20204

Closed

Flamefire mentioned this pull request Apr 2, 2024

{ai}[foss/2022b] PyTorch v2.1.2 w/ CUDA 12.0.0 #20155

Closed

1 task

Conversation

jfgrimm commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexming left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfgrimm commented Jan 24, 2024

Uh oh!

jfgrimm commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casparvl commented Jan 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexming commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexming commented Jan 29, 2024

Uh oh!

lexming left a comment

Choose a reason for hiding this comment

Uh oh!

boegel commented Jan 29, 2024

Uh oh!

jfgrimm commented Jan 29, 2024

Uh oh!

jfgrimm commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfgrimm commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boegel commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boegel commented Feb 1, 2024 • edited by jfgrimm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfgrimm commented Feb 1, 2024

Uh oh!

jfgrimm commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexming left a comment

Choose a reason for hiding this comment

Uh oh!

lexming commented Feb 21, 2024

Uh oh!

casparvl commented Feb 23, 2024

Uh oh!

lexming commented Feb 27, 2024

Uh oh!

boegelbot commented Feb 27, 2024

Uh oh!

boegelbot commented Feb 27, 2024

Uh oh!

emdrago commented Feb 27, 2024

Uh oh!

lexming commented Feb 28, 2024

Uh oh!

boegelbot commented Feb 28, 2024

Uh oh!

boegelbot commented Feb 28, 2024

Uh oh!

Flamefire commented Mar 2, 2024

Uh oh!

schiotz commented Mar 12, 2024

Uh oh!

Flamefire commented Mar 12, 2024

Uh oh!

schiotz commented Mar 12, 2024

Uh oh!

Flamefire commented Mar 12, 2024

Uh oh!

schiotz commented Mar 13, 2024

Uh oh!

Flamefire commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jfgrimm commented Jan 22, 2024 •

edited

Loading

lexming left a comment •

edited

Loading

jfgrimm commented Jan 24, 2024 •

edited

Loading

casparvl commented Jan 27, 2024 •

edited

Loading

lexming commented Jan 29, 2024 •

edited

Loading

jfgrimm commented Jan 29, 2024 •

edited

Loading

jfgrimm commented Jan 31, 2024 •

edited

Loading

boegel commented Feb 1, 2024 •

edited

Loading

boegel commented Feb 1, 2024 •

edited by jfgrimm

Loading

jfgrimm commented Feb 21, 2024 •

edited

Loading

Flamefire commented Mar 13, 2024 •

edited

Loading

casparvl commented Mar 14, 2024 •

edited

Loading