Skip to content

add patches to fix PyTorch 1.10.0 build on POWER#15904

Merged
akesandgren merged 3 commits intoeasybuilders:developfrom
Flamefire:20220725091500_new_pr_PyTorch1100
Nov 30, 2022
Merged

add patches to fix PyTorch 1.10.0 build on POWER#15904
akesandgren merged 3 commits intoeasybuilders:developfrom
Flamefire:20220725091500_new_pr_PyTorch1100

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

(created using eb --new-pr)

@Flamefire Flamefire force-pushed the 20220725091500_new_pr_PyTorch1100 branch from e0e59b3 to bd1a197 Compare July 26, 2022 14:49
@boegel boegel changed the title Fix PyTorch 1.10.0 build on PPC add patches to fix PyTorch 1.10.0 build on POWER Aug 3, 2022
@boegel boegel added the bug fix label Aug 6, 2022
@boegel boegel added this to the next release (4.6.1?) milestone Aug 6, 2022
@akesandgren
Copy link
Copy Markdown
Contributor

@Flamefire can you make a new test report if the tests are ok now?

@Flamefire Flamefire marked this pull request as draft September 5, 2022 14:28
@Flamefire Flamefire force-pushed the 20220725091500_new_pr_PyTorch1100 branch 2 times, most recently from a670419 to d6979a7 Compare September 9, 2022 15:13
@Flamefire Flamefire marked this pull request as ready for review September 22, 2022 09:09
@Flamefire Flamefire force-pushed the 20220725091500_new_pr_PyTorch1100 branch from 5642e96 to cfca576 Compare September 22, 2022 12:23
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusml20 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/9f51285f15cf0d06973182a1377592de for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Sep 23, 2022

Ok this now works on PPC but (still) fails with A100s in distributed/test_c10d_nccl although we use the same NCCL version as PyTorch (2.10.3)

Edit: Fixed by cfca576

@Flamefire
Copy link
Copy Markdown
Contributor Author

This can be merged now. It especially now also uses the release archive I made the PyTorch guys add with pytorch/pytorch#63022 so we can use proper downloads and checksums!

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusi8019 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/5ecdbba2d7fce15f583969b67275d30b for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusa9 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/2f8a69d9a8e9b9c005acdd5710b61ecd for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Oct 10, 2022

Test report by @boegel
FAILED
Build succeeded for 0 out of 3 (3 easyconfigs in total)
node3305.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.85.02, Python 3.6.8
See https://gist.github.com/3a40774891b135e6444a504f6e898798 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@boegel Failing tests:

  • distributed/test_c10d_gloo failed!
  • test_autograd failed!

But I don't see why. Can you attach the full log? At least the latter test should be fixed by PyTorch-1.10.0_fix-kineto-crash.patch

@Flamefire Flamefire force-pushed the 20220725091500_new_pr_PyTorch1100 branch from 25d3709 to a2a2bab Compare October 17, 2022 09:02
@Flamefire Flamefire force-pushed the 20220725091500_new_pr_PyTorch1100 branch from a2a2bab to c1f6e50 Compare October 17, 2022 09:10
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/64d35056711ca55b40b89dfd938b9e14 for a full test report.

@boegel boegel removed this from the next release (4.6.2?) milestone Oct 18, 2022
@boegel boegel added this to the release after 4.6.2 milestone Oct 18, 2022
@Flamefire
Copy link
Copy Markdown
Contributor Author

Can we include this please? As shown above it works on POWER now where it hasn't before. Most of the patches are also already included and tested in the 1.11 PR #16339

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusi8033 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/1249b7071d87a8ca62545bd8223a6a52 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@branfosj @akesandgren Ping on this.

Copy link
Copy Markdown
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Copy Markdown
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit f7fe785 into easybuilders:develop Nov 30, 2022
@Flamefire Flamefire deleted the 20220725091500_new_pr_PyTorch1100 branch November 30, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants