-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: correct node rank #19437
bugfix: correct node rank #19437
Conversation
for more information, see https://pre-commit.ci
I think this is a common problem that almost everyone who uses multiple machines and multiple cards for training will encounter. Should a quick fix be released? |
Hey @cauyxy, great catch ! Would you mind adding a test ? Best, |
Just like the previous one, it might take me some time if I were to do it myself. Can you help me out?😊 |
i have added test for dist training when gpu_per_node > node_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cauyxy. This is perfect ! This test fails on master and the fix is correct ! Thanks for your contribution !
(cherry picked from commit 7b867c7)
(cherry picked from commit 7b867c7)
What does this PR do?
Corrected the logic of calculating rank from dividing by num_nodes to num_gpu_per_meachine
Fixes #19436
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--19437.org.readthedocs.build/en/19437/