This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Improve the speed of the pointwise fusion graph pass #17114
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@zburning Were you able to validate that this fixes the issue you reported? |
samskalicky
approved these changes
Dec 19, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the quick fix!
zhreshold
approved these changes
Dec 19, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
@ptrendx Yes! I test it on a p3 instance, the first batch costs 11.4s for 4 gpus. Thank you for your quick fix. |
ptrendx
added a commit
to ptrendx/mxnet
that referenced
this pull request
Dec 20, 2019
* Debug the long startup time * Optimize backward fusion * Figure out why the fusion pass is called twice * Cleaning * Small optimization
ptrendx
added a commit
that referenced
this pull request
Dec 20, 2019
* Improve the speed of the pointwise fusion graph pass (#17114) * Debug the long startup time * Optimize backward fusion * Figure out why the fusion pass is called twice * Cleaning * Small optimization * [BUGFIX] Fix trainer param order (#17068) * fix trainer param order * Update trainer.py * Update trainer.py * Update trainer.py * [reproducibility] multi_sum_sq review, AtomicAdd removal (#17002) * Update multi_sum_sq to avoid AtomicAdd * Add specific test for multi_sum_sq * Add a determism test and lint issues * better test for cheching op is deterministic * Follow MXNet letters case format * Reduce dimensions of tensors in the test Co-authored-by: Haibin Lin <[email protected]> Co-authored-by: MoisesHer <[email protected]>
3 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #17105
This PR fixes 2 problems (that together lead to the huge time spent doing fusion reported in #17105).
The first problem was the
get_subsets
function - a part of the pointwise fusion graph pass. It involves a step that ensures there are no cycles created in a graph after the fusion step. It does so in 2 steps:n
it looks at those separations and creates a set of nodes that are incompatible to be in the same fusion withn
Both of those steps were improved:
n
are also incompatible, then the separation set produced by that node is strictly smaller than the ones produced by those inputs, so there is no need to create itn
, if the incompatible node producing the separation set was already part of the set of nodesn
is incompatible with, then also every node in that separation set was already included there as well. In the repro script from the issue, this change cut the number of insertion trials in the forward fusion pass from ~1M to ~2000.The other problem fixed in this PR is that due to variable scope of
state_ptr
inCachedOp::Forward
method, the state ofCachedOp
was taken twice for any context, resulting in 2 calls into graph optimization methods instead of 1.In my local environment, the time spent on fusion pass in the repro script came down from over 200s to 4.5s on a single GPU.
@leezu @zburning @Caenorst @samskalicky FYI
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
CachedOp::Forward
method where the shared_ptr toCachedOpState
was taken twice, resulting in launching the optimization step twice.