incorrect grad of gluon.nn.BatchNorm when scale=False #16297

shesung · 2019-09-27T10:17:10Z

When using gluon.nn.BatchNorm(scale=False) on gpu, the computed grad for beta is not correct. The grad of beta seem to be accumulated between iterations.

When setting scale=True or running on cpu, it goes correctly.

This problem may make network hard to converge during trainning.

Environment info (Required)

CentOS Linux release 7.2.1511 (Core)
GTX 1080Ti
Driver Version: 384.69
CUDA Version 9.0.176

installed with pip:
numpy 1.17.2
mxnet-cu90 1.5.0

Code

In this example, the grad of beta shuold be [1, 1, 1] at each iteration.

import mxnet as mx
from mxnet import gluon, autograd

ctx = mx.gpu()
x = mx.nd.ones((1,3,1,1), ctx=ctx)

net = gluon.nn.BatchNorm(scale=False, epsilon=2e-5, momentum=0.0)
net.initialize(ctx=ctx)
trainer = gluon.Trainer(params=net.collect_params(),
                        optimizer='sgd',
                        optimizer_params={'learning_rate': 0.01, 'wd': 0.0005, 'momentum': 0.9})
net.hybridize()

for i in range(10):
    with autograd.record():
        out = net(x)
    out.backward()
    trainer.step(x.shape[0])
    for name, param in net.collect_params().items():
        if 'beta' in name:
            print(name, param.grad(ctx).asnumpy())

output:

batchnorm0_beta [1. 1. 1.]
batchnorm0_beta [2. 2. 2.]
batchnorm0_beta [3. 3. 3.]
batchnorm0_beta [4. 4. 4.]
batchnorm0_beta [5. 5. 5.]
batchnorm0_beta [6. 6. 6.]
batchnorm0_beta [7. 7. 7.]
batchnorm0_beta [8. 8. 8.]
batchnorm0_beta [9. 9. 9.]
batchnorm0_beta [10. 10. 10.]

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-09-27T10:17:14Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Gluon, Bug

lanking520 · 2019-09-27T17:26:29Z

Hi @shesung have you tried: setting scale=False running on cpu, does the output came correctly?

shesung · 2019-09-29T01:57:38Z

@lanking520 Yes. It's correct on cpu.
I also found that, defining the layer with symbol api, the result is correct.

samskalicky · 2019-10-07T19:00:57Z

@zachgk assign [@szha ]

wkcn · 2020-06-10T02:19:58Z

Hi @shesung , the bug has been fixed in #18500 .
Thank you for the report!

lanking520 added Bug Gluon labels Sep 27, 2019

zachgk assigned szha Oct 7, 2019

wkcn mentioned this issue Jun 5, 2020

[Bug Fixed] fix batch norm when fix_gamma is True #18492

Closed

5 tasks

This was linked to pull requests Jun 6, 2020

[Bug Fixed] fix batch norm when fix_gamma is True #18492

Closed

[Bug Fixed] Fix batch norm when grad_req is add #18500

Merged

szha closed this as completed in #18500 Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

shesung commented Sep 27, 2019

mxnet-label-bot commented Sep 27, 2019

lanking520 commented Sep 27, 2019

shesung commented Sep 29, 2019

samskalicky commented Oct 7, 2019

wkcn commented Jun 10, 2020

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

Comments

shesung commented Sep 27, 2019

Environment info (Required)

Code

mxnet-label-bot commented Sep 27, 2019

lanking520 commented Sep 27, 2019

shesung commented Sep 29, 2019

samskalicky commented Oct 7, 2019

wkcn commented Jun 10, 2020