Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DP Logging Aggregation #4138

Merged
merged 9 commits into from
Dec 4, 2020
Merged

Fix DP Logging Aggregation #4138

merged 9 commits into from
Dec 4, 2020

Conversation

justusschock
Copy link
Member

What does this PR do?

Fixes a bug in LoggerConnector, where the metric values that should be reduced are not on the same device in DP

Fixes #4073

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Oct 14, 2020

Hello @justusschock! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-04 17:30:46 UTC

@mergify mergify bot requested a review from a team October 14, 2020 08:56
@justusschock justusschock self-assigned this Oct 14, 2020
@justusschock justusschock added the bug Something isn't working label Oct 14, 2020
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #4138 (9ce9ea0) into master (ed5bda3) will increase coverage by 0%.
The diff coverage is 90%.

@@          Coverage Diff           @@
##           master   #4138   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         124     124           
  Lines        9294    9299    +5     
======================================
+ Hits         8608    8615    +7     
+ Misses        686     684    -2     

@justusschock
Copy link
Member Author

@Borda any clue on the failing Mac tests?

@awaelchli
Copy link
Contributor

@justusschock It comes from master branch

@Borda
Copy link
Member

Borda commented Oct 14, 2020

@Borda any clue on the failing Mac tests?

It is related to latest packages...

@justusschock It comes from master branch

Yes, we can check if there was a package upgrade in last days...

@edenlightning edenlightning added this to the 1.0.3 milestone Oct 19, 2020
@edenlightning
Copy link
Contributor

@justusschock fix failing tests?

@justusschock
Copy link
Member Author

@edenlightning they should not be related to this pr. But I will check again

@SeanNaren
Copy link
Contributor

Can we get a test for this in as well?

@awaelchli
Copy link
Contributor

@justusschock is this fix finished, do you need help?

@justusschock
Copy link
Member Author

It is not yet finished. I was planning to get this done today or tomorrow (my old fix sin't valid anymore, we talked about this offline). If you have some time you can for sure give it a shot :)

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great ! Great catch !

tchaton
tchaton previously requested changes Nov 4, 2020
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you could add a test ?

@edenlightning edenlightning removed this from the 1.0.x milestone Nov 10, 2020
@Borda Borda added logging Related to the `LoggerConnector` and `log()` priority: 0 High priority task labels Dec 3, 2020
@Borda Borda requested a review from tchaton December 3, 2020 20:37
@awaelchli
Copy link
Contributor

can I just say how frustrated I am how long it took me to realize that BoringModel uses batch size 1 so when running in DP mode it does not actually do anything on gpu 1 and so for a long time I was just not able to make the test fail on master 😩

now the reported bug RuntimeError: All input tensors must be on the same device. Received cuda:2 and cuda:3 is fixed and the test that I added fails on master with this error message.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 3, 2020

@adityak2920 is this the issue you were facing?? Does it solve your use-case?

@adityak2920
Copy link

@adityak2920 is this the issue you were facing?? Does it solve your use-case?

Yes, I was facing this issue and will solve my use case.

CHANGELOG.md Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Dec 4, 2020
@Borda
Copy link
Member

Borda commented Dec 4, 2020

@awaelchli @SeanNaren mind review?

@awaelchli
Copy link
Contributor

it requires also @tchaton review because he had requested changes.

@Borda Borda dismissed tchaton’s stale review December 4, 2020 17:02

pls re-review

justusschock and others added 7 commits December 4, 2020 18:04
move result


stupid result object


revert to master


undo import


add "to" method to result


generalize to


try a test


try a test


Revert "try a test"

This reverts commit 22e3c10.

Revert "try a test"

This reverts commit 4d2d8fb.

new test


max epochs


super epoch end 


log in test


hanging test


undo test


initial test that fails on master


step end


pass


step end


step end


epoch end


print


step


check dev


clean up test


sanity check


wtf is go ing on


frustration


debugging test


test


test


test


test


test


test


test


test


unused import
@Borda Borda force-pushed the fix_dp_logging_aggregation branch from bfe5395 to ec1f526 Compare December 4, 2020 17:16
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@Borda Borda merged commit f23f5e5 into master Dec 4, 2020
@Borda Borda deleted the fix_dp_logging_aggregation branch December 4, 2020 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logging Related to the `LoggerConnector` and `log()` priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Parallel bug (return outputs not being moved to same device)
10 participants