Use single-bit for mask in dropout operator #16735

apeforest · 2019-11-06T06:42:44Z

Description

Use single bit in mask for dropout to reduce memory.
This PR fixes #15968

Performance tests are run using the script below:

#!/usr/bin/python
import mxnet as mx
from mxnet import nd

from benchmark.opperf.utils.benchmark_utils import run_performance_test

mx.random.seed(17)
context = mx.cpu()
res = run_performance_test(nd.Dropout, run_backward=True, dtype='float32', ctx=context,
                           inputs=[
                               {"data" : (1024, 1024), "cudnn_off" : "False"}
                           ],
                           warmup=20, runs=100, profiler='native')
print(res)

Results:

Build Flavor	fwd time (master)	fwd time (pr)	bwd time (master)	bwd time (pr)	memory (master)	memory (pr)
CPU w/ MKL BLAS	0.19	0.74	0.06	0.17	6291.45	2162.68
CPU w/o MKL BLAS	0.73	0.89	0.07	0.11	6291.45	4259.83
GPU w/ cuDNN	0.16	0.15	0.11	0.11	4194.30	2162.68
GPU w/o cuDNN	4.36	2.64	0.13	0.11	4194.30	2162.68

Time measured in python: #13896

Build	master	this pr
GPU w/ cuDNN	25.9 ms	25.8 ms
GPU w/o cuDNNN	1.34 s	1.35 s
CPU w/ MKL	262 ms	337 ms
CPU w/o MKL	359 ms	426 ms

@eric-haibin-lin @TaoLv @PatricZhao @ptrendx @roywei please help to review

src/operator/nn/dropout.cc

eric-haibin-lin

@TaoLv @PatricZhao can someone review the CPU changes?

src/operator/nn/dropout.cc

eric-haibin-lin · 2019-12-23T01:00:22Z

src/operator/nn/dropout-inl.h

-      });
+        // mask_out is set per bit position
+        // therefore bitwise shift need to be performed here
+        auto maskIdx = i / 8;


maskIdx -> mask_idx

Same comment for offset, val

eric-haibin-lin · 2019-12-23T01:01:58Z

src/operator/nn/dropout-inl.h

+        bool maskVal = mshadow_op::threshold_eq::Map<real_t>(rand_num, pkeep);
+        if (maskVal) {
+          // set bit
+          mask_out[maskIdx] |= 1U << maskOffset;


will this lead to race condition if the same maskIdx is being set by multiple threads? Shall each thread handle at least 8 bits?

Good catch. I was thinking of setting the step to 8 but forgot to update it in the macro.

After checking into it more, I found ideally this should not happen because RandGenerator<xpu>::kMinNumRandomPerThread is 64 and therefore by design the step size inside LaunchRNG should be a multiple of 8. But then I looked into that piece of code again and found it looks like a bug in calculating the step. Please review my latest change in src/operator/random/sampler.h and let me know if it makes sense. Thanks.

Is this for loop parallelized?

In general I do not recommend writing code this way. There is not documenntation nor guarantee that kMinNumRandomPerThread will always be greater than 8 in the future. Nor does the dropout operator document any assumption about the value of kMinNumRandomPerThread. The code is delicate and will be broken if some contributor changes kMinNumRandomPerThread to values like 4. If there's any assumption, we should add an explicit check so that it won't be broken in the future

Fair point. I will refactor this piece of code.

src/operator/nn/dropout-inl.h

eric-haibin-lin · 2019-12-23T01:03:25Z

src/operator/nn/dropout-inl.h

-      });
+        // mask_out is set per bit position
+        // therefore bitwise shift need to be performed here
+        auto maskIdx = i / 8;


will this lead to race condition?

See comment above.

eric-haibin-lin

potential race condition

TaoLv

It will help to save memory. But curious to know the performance impact.

src/operator/nn/dropout-inl.h

tests/python/unittest/test_operator.py

TaoLv · 2020-01-04T09:16:02Z

@apeforest Thank you for the nice work! Do you have any numbers to share?

memory usage of a model in which dropout workspace used to be a problem?
operator performance benchmark?

eric-haibin-lin · 2020-01-09T22:33:01Z

For GPT-2, the memory usage goes from 30GB to 26GB. For BERT, it goes from 26GB to 23GB. I didn't notice much difference in training throughput.

apeforest · 2020-01-10T01:18:27Z

@TaoLv Thanks for your review. I ran operator profiling using benchmark.opperf.utils.benchmark_utils.run_performance_test. The result shows speed up in forward but some degradation in backward pass.

w/ this change:

[{'Dropout': [{'avg_time_forward_Dropout': 1.3266, 'max_storage_mem_alloc_cpu/0': 4259.8398, 'avg_time_backward_Dropout': 0.2682, 'inputs': {'data': (1024, 1024), 'p': 0.5}}]}]

w/o this change:

[{'Dropout': [{'avg_time_forward_Dropout': 1.7864, 'max_storage_mem_alloc_cpu/0': 6291.4561, 'avg_time_backward_Dropout': 0.1836, 'inputs': {'data': (1024, 1024), 'p': 0.5}}]}]

TaoLv · 2020-01-12T14:55:31Z

@apeforest Thank you for testing it out. Given memory is not always a concern, can we make bit mask an option for dropout?

eric-haibin-lin · 2020-01-13T01:00:02Z

@TaoLv I don't think adding an option is necessary. can we improve the backward kernel?

TaoLv · 2020-01-13T01:59:03Z

@apeforest Could you please also test the operator performance with USE_BLAS=mkl?

pengzhao-intel · 2020-02-13T05:17:04Z

I'm ok with the result. @TaoLv any concern?

It's still 1.36x slower. I will take another look today.

If sacrificing performance (to some extent) can help improve usability, I think we need to consider the trade off.

As I mentioned above, I'm not taking this as a general usability issue. I don't think we want to sacrifice the performance on CPU while memory size is not a concern there. @pengzhao-intel

It will be a concern for the performance drop because we are working on model training recently.
I didn't follow up on all discussions in the thread. One quick question:
Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

roywei

cudnn part LGTM, one concern is the speed reported from profiler is quite different than measured from python side here

Let's make sure we know the performance impact on end to end python.

apeforest · 2020-02-14T02:04:47Z

@roywei Using the test script in #13896

Build	runtime (before)	runtime (after)
CPU w/ MKL	262 ms ± 1.2 ms	337 ms ± 12.5 ms
CPU w/o MKL	359 ms ± 241 µs	426 ms ± 222 µs
GPU w/ cuDNN	25.9 ms ± 202 µs	25.8 ms ± 183 µs
GPU w/o cuDNNN	1.34 s ± 5.83 ms	1.35 s ± 13.1 ms

Using python timer to measure CPU performance with MKL:

This PR:

[{'Dropout': [{'avg_time_Dropout': 1.1714265774935484, 'p50_time_Dropout': 1.1715246364474297, 'p90_time_Dropout': 1.190436165779829, 'p99_time_Dropout': 1.2154309218749404, 'inputs': {'data': (1024, 1024)}}]}]

Master:

[{'Dropout': [{'avg_time_Dropout': 0.6394564639776945, 'p50_time_Dropout': 0.6996351294219494, 'p90_time_Dropout': 1.045508868992329, 'p99_time_Dropout': 1.59036863129586, 'inputs': {'data': (1024, 1024)}}]}]

TaoLv · 2020-02-14T02:37:05Z

Does the avg_time_Dropout include backward time? @apeforest

TaoLv · 2020-02-14T03:21:01Z

src/operator/nn/dropout-inl.h

+      auto mask_idx = i >> 3;  // div 8;
+      uint8_t mask_offset = i & 7;  // mod 8
+      bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
+      ingradptr[i] = outgradptr[i] * mask_val * pk_1;


Let's also use blocking in the backward path:

const int blk_size = 64; const int nblk = count / blk_size; #pragma omp parallel for num_threads(nthr) schedule(static, 8) for (index_t b = 0; b < nblk; ++b) { for (index_t k = 0; k < blk_size; ++k) { index_t i = b * blk_size + k; auto mask_idx = i >> 3; // div 8; uint8_t mask_offset = i & 7; // mod 8 bool mask_val = maskptr[mask_idx] & (1U << mask_offset); ingradptr[i] = outgradptr[i] * mask_val * pk_1; } } // tail if (nblk * blk_size < count) { for (index_t i = nblk * blk_size; i < count; ++i) { auto mask_idx = i >> 3; // div 8; uint8_t mask_offset = i & 7; // mod 8 bool mask_val = maskptr[mask_idx] & (1U << mask_offset); ingradptr[i] = outgradptr[i] * mask_val * pk_1; } } }

After more thoughts, I think we actually don't need to do blocking in the backward pass as there is no write to maskptr and hence no cache eviction nor race condition.

We're writing to ingradptr. We also hope the elements in one cache line will be handled by one openmp thread. With the original parallelization, one cache line is loaded and only one element in it is handled by the current thread. For the next thread, it need load the same cache line, and handle the next element.

However there is no read from ingradptr, therefore this is not a case of the false sharing, right? I tried this block and didn't noticed any noticeable performance gain.

apeforest · 2020-02-14T06:23:58Z

Does the avg_time_Dropout include backward time? @apeforest

Yes, it includes backward time as my run_backward is set to True

apeforest · 2020-02-14T23:33:41Z

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?
@PatricZhao The slowdown comes from extra computation in the new algorithm when Dropout uses MKL implementation. MKL already computed the mask but stored each mask as integer. The new algorithm simply repackage this int32 based mask into bit-based mask and therefore introduced extra runtime. In the ideal case, it would be to enhance MKL dropout to store mask using bits. But it requires modification of the VSL APIs.

TaoLv · 2020-02-15T07:45:38Z

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

pengzhao-intel · 2020-02-15T08:30:18Z

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

What algorithm is used in TF and pytorch?

TaoLv · 2020-02-15T15:31:14Z

What algorithm is used in TF and pytorch?

@pengzhao-intel I don't think TF has a fused dropout operator. It's implement with several small operators. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_ops.py#L4456. So the backward path should go through the backward of these small operators. Hence no bit-mask there.

For PyTorch, I see there is a fused one: https://github.com/pytorch/pytorch/blob/master/tools/autograd/templates/Functions.cpp#L634. The mask tensor should either be Boolean or has compatible type as grad. So no bit-mask either.

In the ideal case, it would be to enhance MKL dropout to store mask using bits. But it requires modification of the VSL APIs.

@apeforest , so far there is no dropout functionality in MKL or MKL-DNN. Here we just use VSL to generate random values. So even we can generate bit-mask, it will increase additional computation for mask_val * 1.0 / pkeep which can be reused from forward path.

eric-haibin-lin

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.
Why does it increase memory load?

Is there any plan for MKLDNN to support fast dropout with bit-mask like CuDNN? I think reducing memory consumption is quite important. CPU does not have memory capacity issue but it will be one for most GPUs and ASICs. I'd push for efficient implementation from MKLDNN in the long term.

apeforest · 2020-02-15T23:19:17Z

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

The memory load is actually reduced even in the case of MKL, right? Please refer to the tests results in the PR description.

TaoLv · 2020-02-16T02:43:30Z

Why does it increase memory load?

If there are N elements, per the Bernoulli distribution generation in VSL, we still need to allocate memory and write N*4 bytes to it. To generate bit mask, we need load the N*4 bytes back and write N/8 bytes with bits.

apeforest · 2020-02-16T18:34:34Z

Why does it increase memory load?

If there are N elements, per the Bernoulli distribution generation in VSL, we still need to allocate memory and write N*4 bytes to it. To generate bit mask, we need load the N*4 bytes back and write N/8 bytes with bits.

The memory for bit-mask is not extra memory. N*sizeof(DType) was used in the master branch: https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/dropout.cc#L124

So for the MKL dropout case,
master branch uses memory N*4 + N*sizeof(DType) vs. this PR N*4 + N/8. This memory reduction is verified through the MXNet profiler results reported in the PR description section.

eric-haibin-lin · 2020-02-17T00:14:40Z

@PatricZhao @TaoLv what do you suggest as the resolution? If CPU performance is a concern, shall we add env_var to control the behavior? Do you agree in the long term we want to push for dropout API in MKLDNN with 1-bit mask?

apeforest · 2020-02-19T21:55:49Z

Given your concern about the performance degradation in the case of MKL dropout, I have disabled this feature when MKL dropout is used. Please review the PR again and let me know if you think this is good to go. Thanks!

TaoLv

Thank you for the turning around, @apeforest. It looks good to me in general but I notice that there are cases failing on dropout. I can approve once they get fixed. Thanks!

This reverts commit 746a8f0.

apeforest · 2020-02-22T07:22:41Z

Hi @TaoLv and @PatricZhao I reverted my last commit of "Do not use bit-mask when MKL dropout is used."

It makes the code too bristle and also involves very complicate logic to check memory allocation at runtime. Here are the main reasons:

(1) MKL dropout support is currently not complete. It does not work if the input data type is smaller than int32 and it does not support broadcast option (when the option axes is specified). This limitation enforces a check at runtime which is not possible in the InferShape function

e.g. In this function, I will need to check if the dtype is greater than int32 in order to use a different shape for MKL Dropout.
https://github.com/apache/incubator-mxnet/pull/16735/files#diff-74c4dc433970c5df31a5e2c4b57c8d71R127

(2) Having different Dropout engine at runtime (based on data type and ) may cause inconsistency in the mixed precision case. Introducing another difference in mask memory allocation complicates this even further.

I think we should focus on enhancing MKL Dropout so that it (1) supports all the different cases as non MKL dropout (2) supports bit-mask.

Please let me know what you think. Thanks!

Lin

eric-haibin-lin

There's a RFC for 1-bit dropout in MKLDNN which we can leverage: oneapi-src/oneDNN#656 (comment)

sxjscience · 2020-08-12T22:05:51Z

Is there anyone that can take a look at this PR ?

apeforest requested review from aaronmarkham, anirudh2290, eric-haibin-lin, iblislin, marcoabreu, nswamy, sergeykolychev, szha and yzhliu as code owners November 6, 2019 06:42

apeforest changed the base branch from benchmark to master November 6, 2019 06:43

apeforest changed the title ~~Use single-bit for mask in dropout operator~~ [DO NOT MERGE] Use single-bit for mask in dropout operator Nov 6, 2019

apeforest force-pushed the perf/dropout-mask branch from 14a3791 to 3192717 Compare December 18, 2019 18:29

apeforest changed the title ~~[DO NOT MERGE] Use single-bit for mask in dropout operator~~ Use single-bit for mask in dropout operator Dec 22, 2019

wkcn reviewed Dec 22, 2019

View reviewed changes

src/operator/nn/dropout.cc Outdated Show resolved Hide resolved

eric-haibin-lin reviewed Dec 23, 2019

View reviewed changes

eric-haibin-lin previously requested changes Dec 23, 2019

View reviewed changes

TaoLv reviewed Dec 23, 2019

View reviewed changes

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved

tests/python/unittest/test_operator.py Outdated Show resolved Hide resolved

apeforest force-pushed the perf/dropout-mask branch 2 times, most recently from 4457579 to 78a40d5 Compare December 26, 2019 22:54

apeforest force-pushed the perf/dropout-mask branch 2 times, most recently from 38c021a to 3874110 Compare January 10, 2020 18:17

apeforest added 3 commits January 14, 2020 10:50

basic version that is verfied on CPU

d9b7ee1

add log message and TODO

5ae7bbe

add backward support for 1-bit mask

0b22e01

roywei reviewed Feb 13, 2020

View reviewed changes

TaoLv reviewed Feb 14, 2020

View reviewed changes

apeforest added 4 commits February 14, 2020 20:28

Speed up backward compute in CPU

354d83c

Merge remote-tracking branch 'upstream/master' into perf/dropout-mask

9a231b1

Improve speed of backward

013bb48

Remove unncessary block in backward

2183a23

eric-haibin-lin reviewed Feb 15, 2020

View reviewed changes

Do not use bit-mask when MKL dropout is used.

746a8f0

Merge remote-tracking branch 'upstream/master' into perf/dropout-mask

93a381a

TaoLv reviewed Feb 20, 2020

View reviewed changes

apeforest added 2 commits February 21, 2020 05:47

Merge remote-tracking branch 'upstream/master' into perf/dropout-mask

2b1bf66

Revert "Do not use bit-mask when MKL dropout is used."

469e3b9

This reverts commit 746a8f0.

Merge remote-tracking branch 'upstream/master' into perf/dropout-mask

575cd2f

eric-haibin-lin reviewed Jul 27, 2020

View reviewed changes

eric-haibin-lin added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Jul 28, 2020

Use single-bit for mask in dropout operator #16735

Are you sure you want to change the base?

Use single-bit for mask in dropout operator #16735

Conversation

apeforest commented Nov 6, 2019 • edited Loading

Description

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

TaoLv left a comment

Choose a reason for hiding this comment

TaoLv commented Jan 4, 2020

eric-haibin-lin commented Jan 9, 2020

apeforest commented Jan 10, 2020 • edited Loading

TaoLv commented Jan 12, 2020

eric-haibin-lin commented Jan 13, 2020

TaoLv commented Jan 13, 2020

pengzhao-intel commented Feb 13, 2020

roywei left a comment

Choose a reason for hiding this comment

apeforest commented Feb 14, 2020 • edited Loading

TaoLv commented Feb 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest Feb 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Feb 14, 2020

apeforest commented Feb 14, 2020 • edited Loading

TaoLv commented Feb 15, 2020

pengzhao-intel commented Feb 15, 2020

TaoLv commented Feb 15, 2020

eric-haibin-lin left a comment

Choose a reason for hiding this comment

apeforest commented Feb 15, 2020 • edited Loading

TaoLv commented Feb 16, 2020 • edited Loading

apeforest commented Feb 16, 2020

eric-haibin-lin commented Feb 17, 2020

apeforest commented Feb 19, 2020

TaoLv left a comment

Choose a reason for hiding this comment

apeforest commented Feb 22, 2020 • edited Loading

eric-haibin-lin left a comment

Choose a reason for hiding this comment

sxjscience commented Aug 12, 2020

apeforest commented Nov 6, 2019 •

edited

Loading

apeforest commented Jan 10, 2020 •

edited

Loading

apeforest commented Feb 14, 2020 •

edited

Loading

apeforest Feb 14, 2020 •

edited

Loading

apeforest commented Feb 14, 2020 •

edited

Loading

apeforest commented Feb 15, 2020 •

edited

Loading

TaoLv commented Feb 16, 2020 •

edited

Loading

apeforest commented Feb 22, 2020 •

edited

Loading