Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

Closed
shesung opened this issue Sep 27, 2019 · 5 comments · Fixed by #18500
Closed

incorrect grad of gluon.nn.BatchNorm when scale=False #16297

shesung opened this issue Sep 27, 2019 · 5 comments · Fixed by #18500
Assignees

Comments

@shesung
Copy link
Contributor

shesung commented Sep 27, 2019

When using gluon.nn.BatchNorm(scale=False) on gpu, the computed grad for beta is not correct. The grad of beta seem to be accumulated between iterations.

When setting scale=True or running on cpu, it goes correctly.

This problem may make network hard to converge during trainning.

Environment info (Required)

CentOS Linux release 7.2.1511 (Core)
GTX 1080Ti
Driver Version: 384.69
CUDA Version 9.0.176

installed with pip:
numpy 1.17.2
mxnet-cu90 1.5.0

Code

In this example, the grad of beta shuold be [1, 1, 1] at each iteration.

import mxnet as mx
from mxnet import gluon, autograd

ctx = mx.gpu()
x = mx.nd.ones((1,3,1,1), ctx=ctx)

net = gluon.nn.BatchNorm(scale=False, epsilon=2e-5, momentum=0.0)
net.initialize(ctx=ctx)
trainer = gluon.Trainer(params=net.collect_params(),
                        optimizer='sgd',
                        optimizer_params={'learning_rate': 0.01, 'wd': 0.0005, 'momentum': 0.9})
net.hybridize()

for i in range(10):
    with autograd.record():
        out = net(x)
    out.backward()
    trainer.step(x.shape[0])
    for name, param in net.collect_params().items():
        if 'beta' in name:
            print(name, param.grad(ctx).asnumpy())

output:

batchnorm0_beta [1. 1. 1.]
batchnorm0_beta [2. 2. 2.]
batchnorm0_beta [3. 3. 3.]
batchnorm0_beta [4. 4. 4.]
batchnorm0_beta [5. 5. 5.]
batchnorm0_beta [6. 6. 6.]
batchnorm0_beta [7. 7. 7.]
batchnorm0_beta [8. 8. 8.]
batchnorm0_beta [9. 9. 9.]
batchnorm0_beta [10. 10. 10.]
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Gluon, Bug

@lanking520
Copy link
Member

Hi @shesung have you tried: setting scale=False running on cpu, does the output came correctly?

@shesung
Copy link
Contributor Author

shesung commented Sep 29, 2019

@lanking520 Yes. It's correct on cpu.
I also found that, defining the layer with symbol api, the result is correct.

@samskalicky
Copy link
Contributor

@zachgk assign [@szha ]

@wkcn
Copy link
Member

wkcn commented Jun 10, 2020

Hi @shesung , the bug has been fixed in #18500 .
Thank you for the report!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants