BatchNorm with axis=-1 is much slower than axis=1 #18646

stu1130 · 2020-06-30T21:07:10Z

Description

import mxnet as mx
from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import nn
import time

npx.set_np()

data = mx.np.random.uniform(size=(32, 100, 100), ctx=mx.gpu())
label = mx.np.ones((32, 100, 100), ctx=mx.gpu())
net = nn.Sequential()
net.add(nn.BatchNorm(axis=-1))
net.initialize(init.Xavier(), ctx=mx.gpu())
loss = gluon.loss.L2Loss()
t = time.time()
for _ in range(5000):
    with autograd.record():
        l = loss(net(data), label)
    l.backward()
mx.nd.waitall()
print('spent: {}s'.format(time.time() - t))

MXNet version: static build with branch v1.7x commit 75ab155
I got around 5 sec with axis=1 and 30 sec with axis=-1 on P3.8xlarge (V100).
Both of case are computing the 32 * 100 data for each axis
similar to #10095

Solution

Thanks @ptrendx to point out that cudnn 7.4 (https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_741) added a new cudnnBatchNormalization*Ex API that gives much better speed for axis = -1

The text was updated successfully, but these errors were encountered:

wkcn · 2020-07-01T00:01:20Z

The reason is that MKLDNN and CuDNN are only applied when axis = 1.
The open PR #18504 fixes it.

However, we will replace mkldnn_off and cudnn_off attributes with environment variables, so the PR is blocked.

stu1130 · 2020-07-01T00:11:07Z

@wkcn Thanks for you detailed explanation.
So I think there are two phrases.

enable cuDNN when axis is not 1
use cudnnBatchNormalizationForwardTrainingEx for NHWC case (I checked the source code, we are all using cudnnBatchNormalizationForwardTraining)

chinakook · 2020-07-06T15:09:57Z

I think NHWC layout is very important in point cloud algorithms.

stu1130 · 2020-07-09T17:06:09Z

I have verified the performance is almost the same after the fix #18504. Close the issue

stu1130 added the Bug label Jun 30, 2020

wkcn added Performance Operator labels Jul 1, 2020

wkcn linked a pull request Jul 1, 2020 that will close this issue

[Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 #18504

Merged

7 tasks

wkcn removed a link to a pull request Jul 1, 2020

[Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 #18504

Merged

7 tasks

stu1130 closed this as completed Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm with axis=-1 is much slower than axis=1 #18646

BatchNorm with axis=-1 is much slower than axis=1 #18646

stu1130 commented Jun 30, 2020 •

edited

Loading

wkcn commented Jul 1, 2020 •

edited

Loading

stu1130 commented Jul 1, 2020

chinakook commented Jul 6, 2020

stu1130 commented Jul 9, 2020

BatchNorm with axis=-1 is much slower than axis=1 #18646

BatchNorm with axis=-1 is much slower than axis=1 #18646

Comments

stu1130 commented Jun 30, 2020 • edited Loading

Description

Solution

wkcn commented Jul 1, 2020 • edited Loading

stu1130 commented Jul 1, 2020

chinakook commented Jul 6, 2020

stu1130 commented Jul 9, 2020

stu1130 commented Jun 30, 2020 •

edited

Loading

wkcn commented Jul 1, 2020 •

edited

Loading