Skip to content

{ai}[foss/2021b] PyTorch v1.12.1 w/ Python 3.9.6 w/ CUDA 11.5.2 & dependencies#17272

Merged
casparvl merged 5 commits intoeasybuilders:developfrom
Flamefire:20230207150040_new_pr_PyTorch1121
Mar 29, 2023
Merged

{ai}[foss/2021b] PyTorch v1.12.1 w/ Python 3.9.6 w/ CUDA 11.5.2 & dependencies#17272
casparvl merged 5 commits intoeasybuilders:developfrom
Flamefire:20230207150040_new_pr_PyTorch1121

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Feb 7, 2023

(created using eb --new-pr)

Update of #17154 using CUDA 11.5 due to an incompatibility.

As this works while the other does not this
Closes #17154

@Flamefire Flamefire changed the title {lib}[GCCcore/11.2.0,foss/2021b,system/system] PyTorch v1.12.1, NCCL v2.10.3, UCX-CUDA v1.11.2, ... w/ Python 3.9.6 {ai}[foss/2021b] PyTorch v1.12.1 w/ Python 3.9.6 w/ CUDA 11.5.2 & dependencies Feb 7, 2023
@boegelbot

This comment was marked as outdated.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/e3cabd5668e5c2ecf4b35a8e4a80112c for a full test report.

…10.3-GCCcore-11.2.0-CUDA-11.5.2.eb, UCX-CUDA-1.11.2-GCCcore-11.2.0-CUDA-11.5.2.eb, magma-2.6.2-foss-2021b-CUDA-11.5.2.eb, cuDNN-8.4.1.50-CUDA-11.5.2.eb, CUDA-11.5.2.eb
@Flamefire Flamefire force-pushed the 20230207150040_new_pr_PyTorch1121 branch from d875191 to 5cbe0e9 Compare February 10, 2023 10:30
@branfosj
Copy link
Copy Markdown
Member

Test report by @branfosj
FAILED
Build succeeded for 5 out of 6 (6 easyconfigs in total)
bear-pg0103u14a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 520.61.05, Python 3.6.8
See https://gist.github.com/7c637f039ff49f68749755c511119bd5 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@branfosj I added a skip for that failing test similar to that in PyTorch 1.12 for CUDA 11.7

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 5 out of 6 (6 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/c81d75ad4be0502437217823c4609154 for a full test report.

@branfosj
Copy link
Copy Markdown
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (6 easyconfigs in total)
bear-pg0103u14a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 520.61.05, Python 3.6.8
See https://gist.github.com/139b292b01bb3b2a4eb61cccca13003b for a full test report.

@branfosj
Copy link
Copy Markdown
Member

Test report by @branfosj
FAILED
Build succeeded for 81 out of 82 (6 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), 1 x NVIDIA Tesla P100-PCIE-16GB, 520.61.05, Python 3.6.8
See https://gist.github.com/595fb693836512854f8aa6ec3b42a8a3 for a full test report.

@casparvl
Copy link
Copy Markdown
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
gcn5.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/f337586642709534ed2c3b087594a63b for a full test report.

@casparvl
Copy link
Copy Markdown
Contributor

It would be nice to get this merged... @branfosj @Flamefire I see two remaining test failures in your test reports. Any chance you can investigate? Do we know if these are just flaky test results, or do they point to 'real' issues? If it's only a flaky test, maybe we can just add skips for them. Or we just merge it 'as is' - an easyconfig that works for everything except one test failure, is still better than not having any EasyConfig at all. And we're taking way too long to get this PyTorch stuff merged... :(

@Flamefire
Copy link
Copy Markdown
Contributor Author

It would be nice to get this merged... @branfosj @Flamefire I see two remaining test failures in your test reports. Any chance you can investigate? Do we know if these are just flaky test results, or do they point to 'real' issues?

Sorry not right now or in the next ~8-10 weeks.

Feel free to merge this but create an issue to investigate this assigning me so I can look at it when I'm back.

@casparvl
Copy link
Copy Markdown
Contributor

Discussed on Slack with @branfosj. Proposed action:
Disable distributed/fsdp/test_fsdp_core and test_native_mha, and create at issues for each.

@casparvl
Copy link
Copy Markdown
Contributor

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Copy Markdown
Collaborator

@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=17272 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_17272 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 2416

Test results coming soon (I hope)...

Details

- notification for comment with ID 1487070214 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/7265f3ed72dcee40c4ec0bf79944f6e6 for a full test report.

@branfosj
Copy link
Copy Markdown
Member

Test report by @branfosj
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
bear-pg0103u11a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 1 x NVIDIA NVIDIA A100-PCIE-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/def32bc7b2a6d3b439c6bfd722ab066a for a full test report.

@casparvl
Copy link
Copy Markdown
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 19 out of 19 (6 easyconfigs in total)
gcn50.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/casparvl/92ffdef5e61765630592c1b2c059cee0 for a full test report.

@branfosj branfosj dismissed casparvl’s stale review March 29, 2023 07:45

changes made

@branfosj branfosj added this to the next release (4.7.2?) milestone Mar 29, 2023
@casparvl
Copy link
Copy Markdown
Contributor

Going in, thanks @Flamefire!

1 similar comment
@casparvl
Copy link
Copy Markdown
Contributor

Going in, thanks @Flamefire!

@casparvl casparvl merged commit 0fdeaeb into easybuilders:develop Mar 29, 2023
@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 30, 2023

Test report by @boegel
FAILED
Build succeeded for 5 out of 6 (6 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 525.85.12, Python 3.6.8
See https://gist.github.com/boegel/202f911923ab0613c6d759e9cee84acb for a full test report.

@Flamefire Flamefire deleted the 20230207150040_new_pr_PyTorch1121 branch March 30, 2023 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants