-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Average Pbar Metrics #4534
Conversation
Hello @tchaton! Thanks for updating this PR.
Comment last updated at 2020-11-12 14:03:57 UTC |
tests/trainer/legacy_deprecate_flow_log_tests/test_trainer_steps_scalar_return.py
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #4534 +/- ##
======================================
Coverage 93% 93%
======================================
Files 116 116
Lines 8879 8881 +2
======================================
+ Hits 8261 8263 +2
Misses 618 618 |
….com/PyTorchLightning/pytorch-lightning into bug_fix/4395_sum_of_loss_across_gpus
Looks good! as mentioned I don't think its super priority to worry about a duplicate reduce call on a 1 item tensor, and for explicitness it's better to do it as a part of this func! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/trainer/legacy_deprecate_flow_log_tests/test_trainer_steps_scalar_return.py
Show resolved
Hide resolved
tests/trainer/legacy_deprecate_flow_log_tests/test_trainer_steps_scalar_return.py
Show resolved
Hide resolved
tests/trainer/legacy_deprecate_flow_log_tests/test_trainer_steps_scalar_return.py
Show resolved
Hide resolved
* wip * update * normalize loss * update test * resolve bug * update test and add TODO * make sure it can be sync * add TODO * update sol (cherry picked from commit 4a01fd0)
* wip * update * normalize loss * update test * resolve bug * update test and add TODO * make sure it can be sync * add TODO * update sol (cherry picked from commit 4a01fd0)
* wip * update * normalize loss * update test * resolve bug * update test and add TODO * make sure it can be sync * add TODO * update sol (cherry picked from commit 4a01fd0)
* wip * update * normalize loss * update test * resolve bug * update test and add TODO * make sure it can be sync * add TODO * update sol
This fix seems to be broken in the latest master (commit 0e19d16) with the metrics summed across GPUs rather than averaged. Works correctly in 1.3.8. |
What does this PR do?
PB: loss from progress bar appears to be sum of loss across all GPUs in Lightning 1.0.3
The sync_dist being an in-place operation, it was reducing across gpu and then it was changing behaviour.
Solution: Clone the value if sync_dist.
TODO: Find a way to make the
sync_dist
only once.Closes #4395
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃