-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006
Conversation
Instead of introducing another "knob". couldn't we make the system "smart" enough to figure out the right option on its own? I'm afraid that a high amount of configuration options might fear off users or they'll simply never hear about them. This could lead to first experiments to go worse than MXNet is capable of because the default settings were not optimal. |
I think we're not at the point of having the framework be smart enough to make this trade-off, which involves a potential performance increase at the expense of a larger model global memory footprint. I think users would be upset if suddenly they have to drop their batchsize due to an out-of-memory error (and lose perf) because the framework was ill-advisedly stretching for the ultimate efficiency. On the other hand, allowing experts to fine-tune performance for production uses and demonstrators is important. I think a next step toward what you're asking is a storage allocation that is allowed to return a null pointer (rather than immediately exiting). That would allow operators to take sensible fall-back approaches if memory is exhausted. This would be a different PR after some discussion. In the meantime, I still recommend adding this knob in the form of the MXNET_GPU_WORKER_NSTREAMS environment variable. I think it natural to first introduce a facility with a manual control knob, then evolve to a point where the framework picks the best setting. The control knob could be retained quietly in the background to support testing and to prove that the automatic selection is performing correctly. |
Will take a look soon. RNN can benefit from the additional stream too. |
Good point, thanks |
@mxnet-label-bot update [CUDA, Operator, pr-work-in-progress] |
@szha I reworked the implementation in response to your suggestions. Eager for your impression. I'm tracking down a CI issue with spawning test processes on CentOS 7 (don't see it on Ubuntu 16.04). I'm hoping that will result in only a minor tweek from the current state of the PR. |
I've rerun some perf analysis of this PR, which I'll remind everyone changes nothing in the default. However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see higher performance for all batchsizes. The perf gains I measured on a run across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container) were:
The primary application area of the PR is one of scale-out training across multiple nodes, where a too-large global batchsize can impact final accuracy (thus driving per-GPU batchsize down). The RN50 global memory increase was from 1.4% (bs 32) to 2.6% (bs 256). This work is no longer "in progress." Requesting final review thanks. @szha @marcoabreu |
@@ -174,6 +174,12 @@ When USE_PROFILER is enabled in Makefile or CMake, the following environments ca | |||
|
|||
## Other Environment Variables | |||
|
|||
* MXNET_GPU_WORKER_NSTREAMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives me the impression that nstream may support more than 2 in the future. However, the c interface only returns an aux stream. Do you foresee that more than 1 aux stream is needed to help accelerating parallel computation in the future? If so, do we need to update the interface to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that 'yes', an operator with 3 inputs might make use of 3 streams in Backward(), so I did not want to propose an environment variable name like MXNET_GPU_WORKER_USE_DUAL_STREAM=0/1 that might soon become obsolete. On the other hand, Convolution only needs 2 streams, and I did not want to burden this enhancement with more complexity than is needed at this time. I propose that when we have a use-case for 3 or more streams, then we can expand the implementation and employ the use-case in our testing of it.
At the end of every kernel execution, there is a fall-off in GPU utilization leading up to the completion of the last grid block. When two streams are being used, these utilization gaps can be filled by work from the second stream. I would guess that having 3 streams would not enhance this effect. On the other hand, let's say you had 3 small independent kernels that each would occupy a third of the GPU. You could see how having 3 streams would be a win in this case over 2 streams.
So it's good that you ask, how might we expand this to 3 or more streams? The MXNET_GPU_WORKER_NSTREAMS environment variable would remain unchanged, though the documentation would indicate that the framework supports a value greater than 2. Legacy env-var uses would be preserved so I think this could happen as part of a minor release. At the RunContext level, a GPUAuxStream* would be replaced by a std::vector<GPUAuxStream*>. The RunContext method get_gpu_aux_stream() might then be changed to RunContex::get_gpu_aux_stream(int aux_stream_id = 0), which would not break operator code that started using the simpler aux_stream API proposed by this PR.
After the rework of this PR to make it far simpler to use within operators, I went back and re-measured the 1 GPU training speeds. The perf gains I measured on a run across 1 32g Volta GPU of Resnet50 v1b (also with DALI in NVIDIA's MXNet container) were:
The speedup is based on a comparison of the 2nd epoch "time cost", where the 1st epoch time is not considered because of cuDNNFind() and DALI overheads that are unique to the 1st epoch. Single GPU training is not really the target of this PR, but at least this shows there's still a 1% improvement at a typical batchsize of 128. I don't recommend enabling 2 streams by default however because the increased use of global memory might make some user's models too big to run. Looking for any further reviewer input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. @szha could you check if the code change addresses your concerns?
yes my concerns are addressed with a clean solution. thanks @DickJC123 @ptrendx |
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
…MS=2. (apache#14006) * Dual stream conv backward(). Enable with MXNET_GPU_WORKER_NSTREAMS=2. * Fix for MSVC compiler. * Fix cpplint. * Add MXNET_GPU_WORKER_NSTREAMS env var documentation. * Improve test function and commenting. * Add description of proper aux stream use using events. * RAII rework to simplify usage within operators. * Fix cpplint. * Expand testing to cover all engines. * Fix NaiveEngine shutdown segfault on CentOS7.
Description
This PR adds a 2nd 'auxiliary' stream to a RunContext and makes it available to operators. This PR includes a modification of the Backward() operation of the cudnn implementation of Convolution to run both the dgrad and wgrad kernels in parallel, rather than in series. For large batchsizes (e.g. 256), each of these kernels consumes all of a GPU and the training performance improvement is negligible. However, when the per-GPU batchsize is small (e.g. 32, as is desirable for fast 'scale out' training that maintains accuracy), the performance improvement is in the neighborhood of 2-3%. More details in a follow-up comment.
By default, this PR does not affect the behavior of the framework. However, by setting the environment variable MXNET_GPU_WORKER_NSTREAMS=2, the cudnn Convolution backward dgrad and wgrad will be run in separate streams. The resulting speed-up comes with a modest downside- now the kernel workspace areas cannot be shared, so the model global memory footprint size grows by 2-3% in the case of Resnet50. This is of no consequence for the main application area of this new feature- small batchsize (per GPU) training.
The bottom line is this can be a useful optional knob for users, particularly those attempting to duplicate published MLPERF results. This PR includes a test of cudnn vs. no_cudnn Convolution with MXNET_GPU_WORKER_NSTREAMS set to both 1 and 2.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments