Skip to content

{ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1#19666

Merged
lexming merged 15 commits intoeasybuilders:developfrom
jfgrimm:20240122121636_new_pr_PyTorch212
Feb 21, 2024
Merged

{ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1#19666
lexming merged 15 commits intoeasybuilders:developfrom
jfgrimm:20240122121636_new_pr_PyTorch212

Conversation

@jfgrimm
Copy link
Member

@jfgrimm jfgrimm commented Jan 22, 2024

(created using eb --new-pr)

haven't actually run the tests yet to see how many fail

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also working on this and made two additional patches to workaround failing tests on our GPU nodes. Please check jfgrimm#3

…atch

Two more patches for PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb
@jfgrimm
Copy link
Member Author

jfgrimm commented Jan 24, 2024

@lexming thanks!

@jfgrimm
Copy link
Member Author

jfgrimm commented Jan 24, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu22.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 2 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/4d87e654aa4fb984e0026ec049fa6ace for a full test report.

Test failure ignored: '742 test failures, 82 test errors (out of 202417)

@casparvl
Copy link
Contributor

casparvl commented Jan 27, 2024

Test report by @casparvl
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/5ddd19d75644b6b53980b1fcd27a6d2f for a full test report.

From stdout:

WARNING: Test failure ignored: '15 test failures, 0 test errors (out of 211173)

From the EasyBuild log:

test_jit 1/1 failed!
distributed/_tensor/test_dtensor_ops 1/1 failed!
distributed/checkpoint/test_fsdp_optim_state 1/1 failed!
distributed/fsdp/test_fsdp_tp_integration 1/1 failed!
distributed/fsdp/test_shard_utils 1/1 failed!
distributed/tensor/parallel/test_tp_random_state 1/1 failed!
test_cpp_extensions_aot_ninja 1/1 failed!
test_cpp_extensions_aot_no_ninja 1/1 failed!
test_fake_tensor 1/1 failed!
test_jit_legacy 1/1 failed!
test_jit_profiling 1/1 failed!
test_nn 1/1 failed!

@lexming
Copy link
Contributor

lexming commented Jan 29, 2024

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node406.hydra.os - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7282 16-Core Processor @ 2.80GHz, 1 x NVIDIA NVIDIA A100-PCIe-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/lexming/c89fa474ef534acc3cd170bcff5c7aca for a full test report.

== 2024-01-27 00:21:20,756 pytorch.py:467 WARNING 4 test failures, 0 test errors (out of 211207):
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)
  • Failed test_nn is TestNN::test_Conv1d_pad_same_cuda_tf32 due to too tight error tolerance. Seems minor.
  • Failed test_dtensor_ops is TestDTensorOpsCPU::test_dtensor_op_db_nn_functional_pad_circular_cpu_float32 shows much uglier errors·. See https://gist.github.com/lexming/252f06cb5a2642082ba634dbe841f7f3 . I could not find any related issue though. The only positive is that it is a single test of the dtensor suite, so my guess is that this test is just bad and there is no fundamental flaw in the code.

@lexming
Copy link
Contributor

lexming commented Jan 29, 2024

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node252.hydra.os - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA P100-PCIe, 545.23.08, Python 3.6.8
See https://gist.github.com/lexming/576add53a0b1c736da9137bd25c6a966 for a full test report.

== 2024-01-25 01:57:32,190 pytorch.py:467 WARNING 4 test failures, 0 test errors (out of 210938):
functorch/test_eager_transforms 1/1 (1 failed, 343 passed, 3 skipped, 1 xfailed, 2 rerun)
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
  • Failed test_eager_transforms is TestCompileTransformsCPU::test_compile_vmap_hessian_cpu with an unexpected success. Seems an implementation error failing to trigger some deprecation warnings. Very minor issue.
  • Failed test_dtensor_ops is TestDTensorOpsCPU::test_dtensor_op_db_nn_functional_pad_circular_cpu_float32, same story as with our A100 on AMD zen2. See {ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 #19666 (comment)

lexming
lexming previously approved these changes Jan 29, 2024
Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my side with just 4 test failures, this looks pretty good to me already.

@boegel
Copy link
Member

boegel commented Jan 29, 2024

@Flamefire Any input on this?

@jfgrimm
Copy link
Member Author

jfgrimm commented Jan 29, 2024

haven't had a chance to look into the 742 test failures, 82 test errors with 2x H100, zen3 yet

@jfgrimm
Copy link
Member Author

jfgrimm commented Jan 29, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu13.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7643 48-Core Processor, 3 x NVIDIA NVIDIA A40, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/29713f8c8160dff849eb5301a5f26fd3 for a full test report.

740 test failures, 82 test errors (out of 202410)

@boegel boegel added this to the 4.x milestone Jan 31, 2024
@jfgrimm
Copy link
Member Author

jfgrimm commented Jan 31, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu21.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/814d5b4c5870b81f8757e7f5398bcba8 for a full test report.

27 test failures, 78 test errors (out of 210834)
from the log:

Failed tests (suites/files):
  test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
  distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
  distributed/fsdp/test_fsdp_flatten_params 1/1 (7 failed, 4 passed, 15 rerun)
  distributed/fsdp/test_fsdp_input 1/1 (1 failed, 1 passed, 4 rerun)
  distributed/fsdp/test_fsdp_mixed_precision 1/1 (2 failed, 2 passed, 61 skipped, 6 rerun)
  distributed/fsdp/test_fsdp_unshard_params 1/1 (1 failed, 1 passed, 12 skipped, 2 rerun)
  distributed/optim/test_zero_redundancy_optimizer 1/1 (3 failed, 9 passed, 30 skipped, 9 rerun)
  distributed/pipeline/sync/skip/test_gpipe 1/1 (13 errors, 26 rerun)
  distributed/pipeline/sync/skip/test_leak 1/1 (8 errors, 16 rerun)
  distributed/pipeline/sync/test_bugs 1/1 (1 skipped, 3 errors, 6 rerun)
  distributed/pipeline/sync/test_inplace 1/1 (2 xfailed, 1 error, 2 rerun)
  distributed/pipeline/sync/test_pipe 1/1 (1 passed, 3 skipped, 52 errors, 104 rerun)
  distributed/pipeline/sync/test_transparency 1/1 (1 error, 2 rerun)
  test_fake_tensor 1/1 (2 failed, 87 passed, 1 xfailed, 4 rerun)
  test_nn 1/1 (1 failed, 2789 passed, 137 skipped, 3 xfailed, 2 rerun)
  distributed/rpc/cuda/test_tensorpipe_agent 1/1 (1 unit test(s) failed)
  distributed/rpc/test_faulty_agent 1/1 (1 unit test(s) failed)
  distributed/rpc/test_share_memory 1/1 (1 unit test(s) failed)
  distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
  distributed/test_store 1/1 (1 unit test(s) failed)
  distributed/fsdp/test_wrap 1/1
  distributed/test_dynamo_distributed 1/1
  distributed/test_inductor_collectives 1/1'

@boegel
Copy link
Member

boegel commented Feb 1, 2024

Test report by @boegel
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 535.154.05, Python 3.6.8
See https://gist.github.com/boegel/3ce3ca8e311f6ced76a39f1c59785126 for a full test report.

edit:

5 test failures, 0 test errors (out of 210904)

Failed tests (suites/files):
distributed/_tensor/test_dtensor_ops 1/1 (3 failed, 190 passed, 43 skipped, 409 xfailed, 6 rerun)
test_fake_tensor 1/1 (2 failed, 86 passed, 1 skipped, 1 xfailed, 4 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1

@jfgrimm jfgrimm marked this pull request as ready for review February 1, 2024 11:32
@boegel
Copy link
Member

boegel commented Feb 1, 2024

So, test results across a variety of systems summarized:

  • on AMD Milan (EPYC 7413) with 2x NVIDIA H100 (viking @ York): 742 test failures, 82 test errors
  • on Intel Ice Lake with 4x A100 (40GB) (snellius @ SURF): 15 test failures, 0 test errors
  • on AMD Rome with 1x NVIDIA A100 (hydra @ VUB): 4 test failures, 0 test errors
  • on Intel Sandy Bridge with 1x NVIDIA P100 (hydra @ VUB): 4 test failures, 0 test errors
  • on AMD Milan (EPYC 7643) with 3x NVIDIA A40 (viking @ York): 740 test failures, 82 test errors
  • on AMD Milan (EPYC 7413) with 1x NVIDIA H100 (viking @ York): 27 test failures, 78 test errors
  • on Intel Cascade Lake with 1 V100 (joltik @ HPC-UGent): 5 test failures

So even in the worst case, 99.6% of all tests pass...

@jfgrimm I'm strongly in favor of setting max_failed_tests = 20 in this easyconfig including a reference to this comment, and then merging it. Dealing with the failing tests can be done in subsequent PRs, if it's worth the trouble...

@jfgrimm
Copy link
Member Author

jfgrimm commented Feb 1, 2024

ohhhh damn I forgot to add cuDNN, magma and NCCL... 🤦

@jfgrimm
Copy link
Member Author

jfgrimm commented Feb 21, 2024

@lexming done, I'll open an issue to revisit the failed tests

#19946

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (yet again)

@casparvl I'll merge this now, if there would be any other changes needed to fix your build issues, it can be done in a follow-up PR.

@lexming
Copy link
Contributor

lexming commented Feb 21, 2024

Merging, thanks @jfgrimm and @Flamefire and all the testers!

@lexming lexming merged commit 2d588e8 into easybuilders:develop Feb 21, 2024
@lexming lexming modified the milestones: 4.x, release after 4.9.0 Feb 21, 2024
@casparvl
Copy link
Contributor

Still no idea why my builds are failing... The only thing I keep seeing consistently is

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

but no clue why. If anyone has ideas, I'd love to hear :D Where there fixes to the toolchain at somepoint? Maybe I should try and recompile that...

Anyway, good to get this merged, no point in making it wait for something that is clearly specific to my machine :)

@lexming
Copy link
Contributor

lexming commented Feb 27, 2024

@boegelbot please test @ jsc-zen3
EB_ARGS="--ignore-test-failure"

@boegelbot
Copy link
Collaborator

@lexming: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=19666 EB_ARGS="--ignore-test-failure" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_19666 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3681

Test results coming soon (I hope)...

Details

- notification for comment with ID 1966533488 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/17acbb2d3429ba625892079f4850409d for a full test report.

@emdrago
Copy link

emdrago commented Feb 27, 2024

Test report by @emdrago
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gput052 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA GeForce RTX 2080 Ti, 535.54.03, Python 3.9.13
See https://gist.github.com/emdrago/3729452d9411286fe8d15bd627eae272 for a full test report.

@lexming
Copy link
Contributor

lexming commented Feb 28, 2024

@boegelbot please test @ generoso
EB_ARGS="--ignore-test-failure"

@boegelbot
Copy link
Collaborator

@lexming: Request for testing this PR well received on login1

PR test command 'EB_PR=19666 EB_ARGS="--ignore-test-failure" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19666 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12984

Test results coming soon (I hope)...

Details

- notification for comment with ID 1968459391 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/f040f52bccc71a5fa2891ccd853806da for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
i8013 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/84b9710d5f445175d4b92c7d6d7f4f69 for a full test report.

@schiotz
Copy link
Contributor

schiotz commented Mar 12, 2024

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

@Flamefire
Copy link
Contributor

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

If it doesn't say anything about allowed failures in the output then this is a different issue. E.g. in my test report above I have:

ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): 7 test failures, 0 test errors (out of 211121):
Failed tests (suites/files):
distributed/checkpoint/test_fsdp_optim_state 1/1 (2 failed, 4 rerun)
distributed/fsdp/test_fsdp_tp_integration 1/1 (1 failed, 4 passed, 2 rerun)
distributed/fsdp/test_shard_utils 1/1 (2 failed, 1 passed, 4 rerun)
distributed/tensor/parallel/test_tp_random_state 1/1 (1 failed, 2 rerun)
test_nn 1/1 (1 failed, 2804 passed, 122 skipped, 3 xfailed, 2 rerun)
+ test_torch 1/1 (at easybuild/easyblocks/p/pytorch.py:507 in test_step)

The + prefix in test_torch is a hint that this test was not accounted in the failures. In fact this test fails entirely and crashes on my machine which is why there is no test failure count information. To be safe the run is considered a failure without comparing against allows failures. We should be more explicit in the log messages here.

@schiotz
Copy link
Contributor

schiotz commented Mar 12, 2024

Thank you, @Flamefire

We are having

== 2024-03-11 23:32:29,846 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): 9 test failures, 0 test errors (out of 211087):
Failed tests (suites/files):
inductor/test_compiled_autograd 1/1 (1 failed, 130 passed, 114 skipped, 2 rerun)
distributed/checkpoint/test_fsdp_optim_state 1/1 (2 failed, 4 rerun)
distributed/fsdp/test_fsdp_tp_integration 1/1 (1 failed, 4 passed, 2 rerun)
distributed/fsdp/test_shard_utils 1/1 (2 failed, 1 passed, 4 rerun)
distributed/tensor/parallel/test_tp_random_state 1/1 (1 failed, 2 rerun)
test_nn 1/1 (2 failed, 2803 passed, 122 skipped, 3 xfailed, 4 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1 (at easybuild/easyblocks/p/pytorch.py:451 in test_step)
== 2024-03-11 23:32:29,847 build_log.py:267 INFO ... (took 9 hours 44 mins 59 secs)

Now we are trying to install with --ignore-test-failure, I don't know if that will then fail, or if doing so is a bad idea.

@Flamefire
Copy link
Contributor

Yes, so test_cpp_extensions_* is failing/crashing. This sounds like something is wrong in your setup (e.g. broken/incompatible Ninja module) Check the log for details (search for e.g. "test_cpp_extensions_aot_ninja" from the bottom up) and then (try to) decide if --ignore-test-failure is a bad idea ;-) It should pass the installation however.

@schiotz
Copy link
Contributor

schiotz commented Mar 13, 2024

@Flamefire

Thanks for your help. It is really strange, it looks like nvcc fails to compile some C++ template code. Could there be some recent update to something that we have missed, e.g. if we have installed Ninja too early?

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /home/modules/software/CUDA/12.1.1/bin/nvcc  -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include -I/tmp/
eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3
.11/site-packages/torch/include/TH -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/THC -I/home/modules/sof
tware/CUDA/12.1.1/include -I/dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_d
irs_test -I/home/modules/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c -c /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.
1/pytorch-v2.1.2/test/cpp_extensions/torch_library.cu -o /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_exten
sions/build/temp.linux-x86_64-cpython-311/torch_library.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BF
LOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INC
LUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXT
ENSION_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
FAILED: /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/build/temp.linux-x86_64-cpython-311/torch_l
ibrary.o 
/home/modules/software/CUDA/12.1.1/bin/nvcc  -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include -I/tmp/eb-n8l
n4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/si
te-packages/torch/include/TH -I/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/include/THC -I/home/modules/software/
CUDA/12.1.1/include -I/dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_dirs_te
st -I/home/modules/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c -c /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pyto
rch-v2.1.2/test/cpp_extensions/torch_library.cu -o /dev/shm/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/cpp_extensions/
build/temp.linux-x86_64-cpython-311/torch_library.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16
_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INCLUDE_E
XTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXTENSION
_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h: In function typename pybind11::detail::type
_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&):
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:120: error: expected template-name before
 < token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:120: error: expected identifier before < 
token
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression b
efore > token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/modules/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression b
efore ) token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                             
 ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/tmp/eb-n8ln4kf7/tmpijxfu2iq/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/home/modules/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

@Flamefire
Copy link
Contributor

Flamefire commented Mar 13, 2024

Yep, that rings a bell for me. And is actually a good reason for why I'm so eager to investigate test failures instead of dismissing them just because many/most are bad code. This is an issue with the pybind11 version we use (pybind11/2.11.1-GCCcore-12.3.0)
I added a patch to fix that.
Try to rebuild your pybind11 from #19047 (included in EB 4.8.2+)

The relevant patch is https://github.com/easybuilders/easybuild-easyconfigs/blob/f9d0188424dd253a0718af1c61ba002f66072b46/easybuild/easyconfigs/p/pybind11/pybind11-2.10.3_fix-nvcc-compat.patch

I suppose you could also apply that to your installed pybind11 module manually

@Flamefire
Copy link
Contributor

We tried building this here at DTU, but it failed with 9 tests that failed. Should it not pass if there are less than 50 tests failing? That part does not seem to work, and I can see that the tests being done above are all with --ignore-test-failure.

It is a not-yet-implemented feature (in EB 5.0 perhaps) that up to 50 tests can fail, or is there some bug such that the max_failed_tests = 50 line in the EasyConfig has no effect?

I opened easybuilders/easybuild-easyblocks#3255 to improve the message in this case

@casparvl
Copy link
Contributor

casparvl commented Mar 14, 2024

Just to let you both know: I hit the same issue as @schiotz that I thought the max_test_failures was ignored. Also got the output:

Failed tests (suites/files):
test_quantization 1/1 (4 failed, 1043 passed, 57 skipped, 8 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)
+ test_cpp_extensions_aot_ninja 1/1
+ test_cpp_extensions_aot_no_ninja 1/1

So I'll try to rebuild pybind11, see if that fixes things for me. Thanks for that hint @Flamefire

I think improving the error message is indeed a good idea. Your explanation here

The + prefix in test_torch is a hint that this test was not accounted in the failures. In fact this test fails entirely and crashes on my machine which is why there is no test failure count information. To be safe the run is considered a failure without comparing against allows failures. We should be more explicit in the log messages here.

Is very clear, but without it the + symbol doesn't mean much, making it non-intuitive why the installation is not proceeding. Thus, https://github.com/easybuilders/easybuild-easyblocks/pull/3255/files sounds very useful to me :)

@schiotz
Copy link
Contributor

schiotz commented Mar 14, 2024

@casparvl @Flamefire

After rebuilding pybind11, PyTorch compiled without issues (finished during the night).

Thank you very much @Flamefire, we had built pybind11 a few weeks before your patch to it.

@emdrago
Copy link

emdrago commented Apr 24, 2024

Test report by @emdrago
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gput052 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 x NVIDIA NVIDIA GeForce RTX 2080 Ti, 535.54.03, Python 3.9.18
See https://gist.github.com/emdrago/e864743fc5bc6ac8f04f3453afe8a0e6 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.