This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-1111] Horovod support for MXNet #12666
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
3d63950
squash commit
ab2225f
get rid of argument
64c6e1f
undo a lot of unnecessary changes
772969b
undo more changes
f84ced1
fix typo
d6b02e8
fix lint
5694aed
address comments and fix rebase mistake
67b14ce
fix typo made during rebase
e50d25f
revert cpu_pinned
bd25f65
revert changes, because works without needing to copy params to GPU. …
f6f9cf3
revert changes to comm and nccl
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, why do we change to CPU from CPUPinned in this PR? Is it going to cause issue if we stick to CPUPinned when conduct distributed training with Horovod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my benchmark, it doesn't affect performance.
You may recall we can't use
horovod.local_rank()
to determine the GPU id becausemxnet
cannot be compile-time dependent onhorovod
. But then if you set all GPUs from 0-7 to useCPUPinned(0)
you will get 8 processes starting the CUDA driver on GPU 0, which wastes memory and you won't be able to get the largest batch size.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctcyang what did you benchmark? Did you also run kvstore('local') and kvstore('nccl') to verify the perf impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my benchmark, I ran kvstore('nccl') and kvstore('device') and there is no perf impact before and after the change. I did not test kvstore('local').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that data copy will more likely become the bottleneck as GPUs become faster in the next generation. What about adding an ctx option like https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=logis#mxnet.ndarray.zeros which defaults to cpu_pinned(0)? For horovod users, if they have memory issues, they can either pass ctx=mx.cpu() or mx.cpu_pinned(local_rank()).