Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

DickJC123 · 2019-01-28T02:59:35Z

Description

This PR adds a 2nd 'auxiliary' stream to a RunContext and makes it available to operators. This PR includes a modification of the Backward() operation of the cudnn implementation of Convolution to run both the dgrad and wgrad kernels in parallel, rather than in series. For large batchsizes (e.g. 256), each of these kernels consumes all of a GPU and the training performance improvement is negligible. However, when the per-GPU batchsize is small (e.g. 32, as is desirable for fast 'scale out' training that maintains accuracy), the performance improvement is in the neighborhood of 2-3%. More details in a follow-up comment.

By default, this PR does not affect the behavior of the framework. However, by setting the environment variable MXNET_GPU_WORKER_NSTREAMS=2, the cudnn Convolution backward dgrad and wgrad will be run in separate streams. The resulting speed-up comes with a modest downside- now the kernel workspace areas cannot be shared, so the model global memory footprint size grows by 2-3% in the case of Resnet50. This is of no consequence for the main application area of this new feature- small batchsize (per GPU) training.

The bottom line is this can be a useful optional knob for users, particularly those attempting to duplicate published MLPERF results. This PR includes a test of cudnn vs. no_cudnn Convolution with MXNET_GPU_WORKER_NSTREAMS set to both 1 and 2.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[ X] Changes are complete (i.e. I finished coding on this PR)
[C ] All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
[ C] Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[ X] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

marcoabreu · 2019-01-28T08:07:08Z

Instead of introducing another "knob". couldn't we make the system "smart" enough to figure out the right option on its own? I'm afraid that a high amount of configuration options might fear off users or they'll simply never hear about them. This could lead to first experiments to go worse than MXNet is capable of because the default settings were not optimal.

DickJC123 · 2019-01-29T18:40:24Z

I think we're not at the point of having the framework be smart enough to make this trade-off, which involves a potential performance increase at the expense of a larger model global memory footprint. I think users would be upset if suddenly they have to drop their batchsize due to an out-of-memory error (and lose perf) because the framework was ill-advisedly stretching for the ultimate efficiency. On the other hand, allowing experts to fine-tune performance for production uses and demonstrators is important. I think a next step toward what you're asking is a storage allocation that is allowed to return a null pointer (rather than immediately exiting). That would allow operators to take sensible fall-back approaches if memory is exhausted. This would be a different PR after some discussion.

In the meantime, I still recommend adding this knob in the form of the MXNET_GPU_WORKER_NSTREAMS environment variable. I think it natural to first introduce a facility with a manual control knob, then evolve to a point where the framework picks the best setting. The control knob could be retained quietly in the background to support testing and to prove that the automatic selection is performing correctly.

szha · 2019-01-30T00:48:49Z

Will take a look soon. RNN can benefit from the additional stream too.

marcoabreu · 2019-01-30T05:48:16Z

Good point, thanks

…wgrad

src/operator/nn/cudnn/cudnn_convolution-inl.h

vandanavk · 2019-02-05T23:00:09Z

@mxnet-label-bot update [CUDA, Operator, pr-work-in-progress]

DickJC123 · 2019-02-11T19:23:19Z

@szha I reworked the implementation in response to your suggestions. Eager for your impression. I'm tracking down a CI issue with spawning test processes on CentOS 7 (don't see it on Ubuntu 16.04). I'm hoping that will result in only a minor tweek from the current state of the PR.

DickJC123 · 2019-02-16T02:39:48Z

I've rerun some perf analysis of this PR, which I'll remind everyone changes nothing in the default. However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see higher performance for all batchsizes. The perf gains I measured on a run across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container) were:

batchsize  32: 6.0% speedup
batchsize  64: 0.8% speedup
batchsize 128: 1.6% speedup
batchsize 256: 0.4% speedup

The primary application area of the PR is one of scale-out training across multiple nodes, where a too-large global batchsize can impact final accuracy (thus driving per-GPU batchsize down). The RN50 global memory increase was from 1.4% (bs 32) to 2.6% (bs 256).

This work is no longer "in progress." Requesting final review thanks. @szha @marcoabreu

eric-haibin-lin · 2019-02-18T05:40:30Z

docs/faq/env_var.md

@@ -174,6 +174,12 @@ When USE_PROFILER is enabled in Makefile or CMake, the following environments ca

 ## Other Environment Variables

+* MXNET_GPU_WORKER_NSTREAMS


This gives me the impression that nstream may support more than 2 in the future. However, the c interface only returns an aux stream. Do you foresee that more than 1 aux stream is needed to help accelerating parallel computation in the future? If so, do we need to update the interface to reflect that?

The short answer is that 'yes', an operator with 3 inputs might make use of 3 streams in Backward(), so I did not want to propose an environment variable name like MXNET_GPU_WORKER_USE_DUAL_STREAM=0/1 that might soon become obsolete. On the other hand, Convolution only needs 2 streams, and I did not want to burden this enhancement with more complexity than is needed at this time. I propose that when we have a use-case for 3 or more streams, then we can expand the implementation and employ the use-case in our testing of it.

At the end of every kernel execution, there is a fall-off in GPU utilization leading up to the completion of the last grid block. When two streams are being used, these utilization gaps can be filled by work from the second stream. I would guess that having 3 streams would not enhance this effect. On the other hand, let's say you had 3 small independent kernels that each would occupy a third of the GPU. You could see how having 3 streams would be a win in this case over 2 streams.

So it's good that you ask, how might we expand this to 3 or more streams? The MXNET_GPU_WORKER_NSTREAMS environment variable would remain unchanged, though the documentation would indicate that the framework supports a value greater than 2. Legacy env-var uses would be preserved so I think this could happen as part of a minor release. At the RunContext level, a GPUAuxStream* would be replaced by a std::vector<GPUAuxStream*>. The RunContext method get_gpu_aux_stream() might then be changed to RunContex::get_gpu_aux_stream(int aux_stream_id = 0), which would not break operator code that started using the simpler aux_stream API proposed by this PR.

DickJC123 · 2019-02-19T19:48:53Z

After the rework of this PR to make it far simpler to use within operators, I went back and re-measured the 1 GPU training speeds. The perf gains I measured on a run across 1 32g Volta GPU of Resnet50 v1b (also with DALI in NVIDIA's MXNet container) were:

batchsize  32: 2.9% speedup
batchsize  64: 1.4% speedup
batchsize 128: 0.95% speedup
batchsize 256: 0.15% speedup

The speedup is based on a comparison of the 2nd epoch "time cost", where the 1st epoch time is not considered because of cuDNNFind() and DALI overheads that are unique to the 1st epoch.

Single GPU training is not really the target of this PR, but at least this shows there's still a 1% improvement at a typical batchsize of 128. I don't recommend enabling 2 streams by default however because the increased use of global memory might make some user's models too big to run.

Looking for any further reviewer input.

eric-haibin-lin

looks good. @szha could you check if the code change addresses your concerns?

szha · 2019-02-24T02:56:14Z

yes my concerns are addressed with a clean solution. thanks @DickJC123 @ptrendx

…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.

Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2.

ff759f2

DickJC123 requested a review from anirudh2290 as a code owner January 28, 2019 02:59

sandeep-krishnamurthy added Operator CUDA pr-awaiting-review PR is waiting for code review labels Jan 28, 2019

DickJC123 added 3 commits January 28, 2019 16:35

Fix for MSVC compiler.

d52774d

Fix cpplint.

1cf5c67

Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

e5b57a8

DickJC123 requested a review from szha as a code owner January 29, 2019 03:10

Improve test function and commenting.

d16e85f

Merge remote-tracking branch 'mxnet/master' into parallel_conv_dgrad_…

b44d096

…wgrad

szha reviewed Jan 31, 2019

View reviewed changes

src/operator/nn/cudnn/cudnn_convolution-inl.h Outdated Show resolved Hide resolved

Add description of proper aux stream use using events.

838bda0

marcoabreu added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Feb 5, 2019

DickJC123 added 3 commits February 8, 2019 19:23

RAII rework to simplify usage within operators.

c3f3320

Fix cpplint.

db5075f

Expand testing to cover all engines.

ea56c91

Fix NaiveEngine shutdown segfault on CentOS7.

790a998

szha removed the pr-work-in-progress PR is still work in progress label Feb 16, 2019

szha requested a review from eric-haibin-lin February 16, 2019 04:21

eric-haibin-lin reviewed Feb 18, 2019

View reviewed changes

eric-haibin-lin approved these changes Feb 24, 2019

View reviewed changes

szha merged commit 5f32f32 into apache:master Feb 24, 2019

DickJC123 mentioned this pull request Mar 6, 2019

Bypass ThreadedEngine in test_convolution_multiple_streams. #14338

Merged

4 tasks

leleamol mentioned this pull request Jun 5, 2019

[MXNET-1412] Backporting the fix to v1.4.x branch to prevent the crash in naive engine #15154

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

DickJC123 commented Jan 28, 2019

marcoabreu commented Jan 28, 2019

DickJC123 commented Jan 29, 2019

szha commented Jan 30, 2019

marcoabreu commented Jan 30, 2019

vandanavk commented Feb 5, 2019

DickJC123 commented Feb 11, 2019

DickJC123 commented Feb 16, 2019 •

edited

Loading

eric-haibin-lin Feb 18, 2019

DickJC123 Feb 18, 2019

DickJC123 commented Feb 19, 2019

eric-haibin-lin left a comment

szha commented Feb 24, 2019

		@@ -174,6 +174,12 @@ When USE_PROFILER is enabled in Makefile or CMake, the following environments ca

		## Other Environment Variables

		* MXNET_GPU_WORKER_NSTREAMS

Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

Conversation

DickJC123 commented Jan 28, 2019

Description

Checklist

Essentials

Changes

Comments

marcoabreu commented Jan 28, 2019

DickJC123 commented Jan 29, 2019

szha commented Jan 30, 2019

marcoabreu commented Jan 30, 2019

vandanavk commented Feb 5, 2019

DickJC123 commented Feb 11, 2019

DickJC123 commented Feb 16, 2019 • edited Loading

eric-haibin-lin Feb 18, 2019

Choose a reason for hiding this comment

DickJC123 Feb 18, 2019

Choose a reason for hiding this comment

DickJC123 commented Feb 19, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

szha commented Feb 24, 2019

DickJC123 commented Feb 16, 2019 •

edited

Loading