Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

BatchNorm with axis=-1 is much slower than axis=1 #18646

Closed
stu1130 opened this issue Jun 30, 2020 · 4 comments
Closed

BatchNorm with axis=-1 is much slower than axis=1 #18646

stu1130 opened this issue Jun 30, 2020 · 4 comments

Comments

@stu1130
Copy link
Contributor

stu1130 commented Jun 30, 2020

Description

import mxnet as mx
from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import nn
import time

npx.set_np()

data = mx.np.random.uniform(size=(32, 100, 100), ctx=mx.gpu())
label = mx.np.ones((32, 100, 100), ctx=mx.gpu())
net = nn.Sequential()
net.add(nn.BatchNorm(axis=-1))
net.initialize(init.Xavier(), ctx=mx.gpu())
loss = gluon.loss.L2Loss()
t = time.time()
for _ in range(5000):
    with autograd.record():
        l = loss(net(data), label)
    l.backward()
mx.nd.waitall()
print('spent: {}s'.format(time.time() - t))

MXNet version: static build with branch v1.7x commit 75ab155
I got around 5 sec with axis=1 and 30 sec with axis=-1 on P3.8xlarge (V100).
Both of case are computing the 32 * 100 data for each axis
similar to #10095

Solution

Thanks @ptrendx to point out that cudnn 7.4 (https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_741) added a new cudnnBatchNormalization*Ex API that gives much better speed for axis = -1

@stu1130 stu1130 added the Bug label Jun 30, 2020
@wkcn
Copy link
Member

wkcn commented Jul 1, 2020

The reason is that MKLDNN and CuDNN are only applied when axis = 1.
The open PR #18504 fixes it.

However, we will replace mkldnn_off and cudnn_off attributes with environment variables, so the PR is blocked.

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 1, 2020

@wkcn Thanks for you detailed explanation.
So I think there are two phrases.

  1. enable cuDNN when axis is not 1
  2. use cudnnBatchNormalizationForwardTrainingEx for NHWC case (I checked the source code, we are all using cudnnBatchNormalizationForwardTraining)

@wkcn wkcn linked a pull request Jul 1, 2020 that will close this issue
7 tasks
@chinakook
Copy link
Contributor

I think NHWC layout is very important in point cloud algorithms.

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 9, 2020

I have verified the performance is almost the same after the fix #18504. Close the issue

@stu1130 stu1130 closed this as completed Jul 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants