Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: reduce allgather costs #217

Merged
merged 15 commits into from
May 3, 2021
Merged

Conversation

maximsch2
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

We have a number of metrics that are following a pattern of accumulating (preds, targets) in the state as a list. This leads to a lot of individual AllGather communications at the very end as we end up running them one validation batch at a time. In this diff, I'm suggesting we do a cat on those states before we do synchronization.

I'm showing it only for AUROC (sending PR to get CI to run tests, but they do pass locally). If people agree that this is a good idea, we can discuss how to do this - might be we can have a flag on the base Metric that will enable this in vanilla _sync_dist function (to avoid having to copy/paste this snippet around to PR, AveragePrecision and possible other metrics as well)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Apr 29, 2021

Codecov Report

Merging #217 (d13702d) into master (df6e4ba) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #217      +/-   ##
==========================================
+ Coverage   96.72%   96.73%   +0.01%     
==========================================
  Files          92      184      +92     
  Lines        2935     5888    +2953     
==========================================
+ Hits         2839     5696    +2857     
- Misses         96      192      +96     
Flag Coverage Δ
Linux 79.82% <81.57%> (+0.06%) ⬆️
Windows 79.82% <81.57%> (+0.06%) ⬆️
cpu 96.73% <100.00%> (+0.01%) ⬆️
gpu 96.73% <ø> (?)
macOS 96.73% <100.00%> (+0.01%) ⬆️
pytest 96.73% <100.00%> (+0.01%) ⬆️
python3.6 95.66% <100.00%> (+0.01%) ⬆️
python3.8 96.70% <100.00%> (+0.01%) ⬆️
python3.9 ?
torch1.3.1 95.66% <100.00%> (+0.01%) ⬆️
torch1.4.0 95.78% <100.00%> (+0.01%) ⬆️
torch1.8.1 96.60% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
torchmetrics/classification/auc.py 100.00% <100.00%> (ø)
torchmetrics/classification/auroc.py 92.68% <100.00%> (+0.18%) ⬆️
torchmetrics/classification/average_precision.py 100.00% <100.00%> (ø)
...chmetrics/classification/precision_recall_curve.py 100.00% <100.00%> (ø)
torchmetrics/metric.py 95.50% <100.00%> (+0.04%) ⬆️
torchmetrics/regression/pearson.py 100.00% <100.00%> (ø)
torchmetrics/regression/spearman.py 100.00% <100.00%> (ø)
torchmetrics/regression/ssim.py 100.00% <100.00%> (ø)
...unctional/classification/precision_recall_curve.py 91.30% <0.00%> (ø)
...etrics/functional/regression/explained_variance.py 100.00% <0.00%> (ø)
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df6e4ba...d13702d. Read the comment docs.

@maximsch2
Copy link
Contributor Author

Actually implemented a more general solution for dist_reduce_fx="cat" and switched metrics to using it. Interestingly, we didn't have any built-in metrics using "cat" reduction - all of them were doing it manually.

Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to add change to more metrics!

torchmetrics/metric.py Outdated Show resolved Hide resolved
torchmetrics/metric.py Outdated Show resolved Hide resolved
@SkafteNicki SkafteNicki added the enhancement New feature or request label May 3, 2021
@SkafteNicki SkafteNicki added this to the v0.4 milestone May 3, 2021
@maximsch2
Copy link
Contributor Author

Oh, nice, I missed those!

@SkafteNicki SkafteNicki enabled auto-merge (squash) May 3, 2021 19:35
@SkafteNicki SkafteNicki merged commit 7a9e50a into Lightning-AI:master May 3, 2021
@maximsch2 maximsch2 deleted the reduce_comms branch May 3, 2021 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants