-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: reduce allgather costs #217
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #217 +/- ##
==========================================
+ Coverage 96.72% 96.73% +0.01%
==========================================
Files 92 184 +92
Lines 2935 5888 +2953
==========================================
+ Hits 2839 5696 +2857
- Misses 96 192 +96
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Actually implemented a more general solution for dist_reduce_fx="cat" and switched metrics to using it. Interestingly, we didn't have any built-in metrics using "cat" reduction - all of them were doing it manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to add change to more metrics!
Co-authored-by: Nicki Skafte <[email protected]>
Oh, nice, I missed those! |
Before submitting
What does this PR do?
We have a number of metrics that are following a pattern of accumulating (preds, targets) in the state as a list. This leads to a lot of individual AllGather communications at the very end as we end up running them one validation batch at a time. In this diff, I'm suggesting we do a
cat
on those states before we do synchronization.I'm showing it only for AUROC (sending PR to get CI to run tests, but they do pass locally). If people agree that this is a good idea, we can discuss how to do this - might be we can have a flag on the base
Metric
that will enable this in vanilla_sync_dist
function (to avoid having to copy/paste this snippet around to PR, AveragePrecision and possible other metrics as well)PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃