Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Increase perfomance of BulkAppend and BulkFlush #14067

Merged
merged 2 commits into from
Feb 6, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 5, 2019

Description

Increase the performance of BulkAppend and BulkFlush methods used in Gluon hybridized models with static_alloc=True, static_shape=False.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • The previous way of bulking ops in Gluon was creating a new lambda during every BulkAppend function in kind of recursive scheme, which involved copying all the environment every time. This PR changes that by instead introducing a vector of lambda functions to the BulkStatus and populating that vector.
  • That change improves the time to perform BulkAppend, but BulkFlush still needed to perform the copy of the entire vector of lambdas, including all of the environment. This is alleviated by, instead of passing the vector by value, a shared_ptr is passed instead, increasing the performance of BulkFlush function by ~3.5x from 70us to ~20us.

Comments

  • Those changes have the biggest impact in multi-GPU or small model scenario when the speed of scheduling work is the most important (in multi-GPU all work is scheduled by a single Python thread).
  • The speed of multiGPU launches should be improved further by going to multiprocess model where each process handles a single GPU.

@eric-haibin-lin

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-review, Performance]

@marcoabreu marcoabreu added Performance pr-awaiting-review PR is waiting for code review labels Feb 5, 2019
Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So we should be able to see bigger performance boost when training with Horovod where a process handles a single GPU?

@yuxihu
Copy link
Member

yuxihu commented Feb 5, 2019

@mxnet-label-bot update [pr-awaiting-merge, Performance]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Feb 5, 2019
@ptrendx
Copy link
Member Author

ptrendx commented Feb 5, 2019

@yuxihu It is actually the other way around - single process per GPU alleviates some of those issues, because each GPU is handled independently. This helps the most the single process-multi GPU cases (where single Python thread needs to launch everything on all GPUs) and small batch size scenarios, where you do not have much time to launch your work.

@yuxihu
Copy link
Member

yuxihu commented Feb 5, 2019

@ptrendx Thanks for the explanation. Anyway it is good improvement.

@junrushao
Copy link
Member

Why a shared_ptr rather than unique_ptr?

@ptrendx
Copy link
Member Author

ptrendx commented Feb 6, 2019

@junrushao1994 2 reasons:

  • std::move to lambda is not supported before C++14 without some arcane template tricks, which I did not want to put there
  • when I tried it (disregarding warnings from GCC that I'm using C++14 feature), I got errors in some other places in the engine that I try to use deleted copy function of those lambdas and did not investigate further whether that's a bigger issue or just a lack of std::move in few places.

@ptrendx
Copy link
Member Author

ptrendx commented Feb 6, 2019

But I agree unique_ptr would be the ultimate solution there.

@junrushao
Copy link
Member

I see. Thanks!

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apeforest apeforest merged commit 149d810 into apache:master Feb 6, 2019
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Performance pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants