Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Updating mxnet from 1.0.0, networks give different outputs #14421

Closed
jmerkow opened this issue Mar 13, 2019 · 16 comments
Closed

Updating mxnet from 1.0.0, networks give different outputs #14421

jmerkow opened this issue Mar 13, 2019 · 16 comments

Comments

@jmerkow
Copy link
Contributor

jmerkow commented Mar 13, 2019

I am working in a production environment, where have some networks implemented in mxnet 1.0.0. I am working updating our systems and trying to push to the latest mxnet (1.4.x as of now) but when we upgrade our networks produce different outputs.
We are using symbols saved to json files and arg/aux_params stored in .params files. these were all produced by mxnet 1.0.0 or earlier.

When using the latest mxnet (or 1.4.x) we are getting different outputs for the same inputs, with our saved models. I have been trying to use git bisect or slowly upgrading versions to figure out where this breaking change occurred but there are issues with your git history and/or some strange (compiler??) incompatibilities which prevent getting a clean checkout/build for nearly all of the intermediate versions…

These are VERY different outputs, im rounding to about the 5 decimal point, or higher, so these aren’t numerical differences.
And we are using modules, not gluons (part of the point is upgrade and move towards gluons)

Does anyone have any idea what could be causing this? Or what to look into to solve it?

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@jmerkow jmerkow changed the title Updating mxnet from 1.0.0, network give different ouputs Updating mxnet from 1.0.0, networks give different outputs Mar 13, 2019
@andrewfayres
Copy link
Contributor

Do you have a sample model which can be shared to reproduce this? There is a backwards compatibility checker for models as part of the nightly pipeline. It trains models on earlier releases and checks for consistency on the latest builds. It's possible there is an edge case which is being missed.

@mxnet-label-bot add [Pending Requestor Info, Bug]

@jmerkow
Copy link
Contributor Author

jmerkow commented Mar 17, 2019

Could I run the tool and pass the output instead?
I need to run down some checks first as well... I want to:

  1. make sure the weights are the same after being loaded
  2. that there were no changes to any of the functions used for batch transform on the MXNet side (i.e. resize_short, center_crop, etc).

@chinakook
Copy link
Contributor

How do you use your model to inference? If you are using mx.module, please make sure the parameter for_training is set to be 0.

@piyushghai
Copy link
Contributor

@jmerkow The tool does save and load the same weights and checks the forward pass on the model.
It does not check in the intermediate layers in the model graph.

@jmerkow
Copy link
Contributor Author

jmerkow commented May 20, 2019

Ok I am working on this again.

I've checked that the parameter files are the same. i.e. something like this:

## in a docker with mx1.4.0 loaded
np_arg_params = {key: value.asnumpy() for key, value in arg_params.items()}
np.save('mx140.npy', np_arg_params)
...

## in another docker with mx100 loaded
import numpy as np
np_arg_params2 = np.load('../140/mx140.npy')[()]
assert set(np_arg_params2.keys()).difference(np_arg_params.keys()) == set()
assert set(np_arg_params.keys()).difference(np_arg_params2.keys()) == set()
assert all((np_arg_params[k]-np_arg_params2[k]).sum() ==0 for k in np_arg_params.keys())

I double checked that the actual input to the image is the same (after transforms etc).

We use the module doing something like this:

        mod.forward(batch) # batch is basically a named tuple with an NDArray at batch.data
        output = mod.get_outputs()[0].asnumpy() 

EDIT: Ok I went through and save the outputs of each layer, and I tracked down the first output that has a difference and its a run-of-the-mill 2D convolution layer:

[{u'attr': {u'kernel': u'(3, 3)',
   u'no_bias': u'True',
   u'num_filter': u'96',
   u'pad': u'(0, 0)',
   u'stride': u'(1, 1)'},
  u'inputs': [[42, 0, 0], [43, 0, 0]],
  u'name': u'in_stem_conv6_3*3_conv2d',
  u'op': u'Convolution'}]

@Piyush3dB @andrewfayres

@jmerkow
Copy link
Contributor Author

jmerkow commented May 20, 2019

Doing some more analysis, it looks like there are some differences at the convolution layer described above, but the changes are relatively minor. However, there is a global pooling layer at the end of the network which seems to have a VERY VERY large difference. I'm using the following to calculate error:

for n in layers:
    x, y = output[n], mx140_outputs[n]
    print(n, np.mean([np.abs(((xi-yi)/(xi+1e-10)).sum()) for xi, yi in zip(x,y)]))

the error values are all less than 0.1, except after the global pooling layer i get global_avgpool_output 8.13365e+11

Was there some change to global pooling that would cause this?

{u'attr': {u'global_pool': u'True',
   u'kernel': u'(8, 8)',
   u'pad': u'(1, 1)',
   u'pool_type': u'avg'},
  u'inputs': [[1229, 0, 0]],
  u'name': u'global_avgpool',
  u'op': u'Pooling'}

Also looking at the network itself, the layer going into the global pool is 5x5, but the kernel is set to 8x8.

@jmerkow
Copy link
Contributor Author

jmerkow commented May 20, 2019

Ok I may have figured it out.

It appears that now global pooling ignores padding (including Pooling_v1). If i set padding=0 on the global pooling layer i get the same values (incorrect) in mx 1.0.0 as mx 1.4.0. Is there a way to prevent this?

I'd like pooling to use the padding values when calculating the global pool.

@Piyush3dB @andrewfayres

@jmerkow
Copy link
Contributor Author

jmerkow commented May 20, 2019

Here is a small example that you can use to reproduce the issue:
Code:

import mxnet as mx
class Batch(object):
    def __init__(self, data):
        self.data = data
    def get_batch_shape(self):
        return tuple(self.data[0].shape)

data = mx.sym.Variable('data')
gpool = mx.sym.Pooling(data, name='gpooling', global_pool=True, pad=(1,1), pool_type='avg', kernel=(8,8))
mod = mx.mod.Module(gpool, context=mx.gpu(0), label_names=[])
data = Batch([mx.ndarray.ones((1, 3, 5, 5))])
mod.bind(for_training=True, force_rebind=True, data_shapes=[('data', data.get_batch_shape())],)
mod.init_params()
mod.forward(data)
print(mod.get_outputs()[0].asnumpy().squeeze().tolist())

MX 1.0.0 output:

[0.6399999856948853, 0.6399999856948853, 0.6399999856948853]

MX 1.4.0 output:

[1.0, 1.0, 1.0]

This appears to be a bug with #9730...

@larroy
Copy link
Contributor

larroy commented Jun 14, 2019

So the correct behaviour is the new behaviour, are you requesting a flag to have the option to use the old incorrect behaviour?

@larroy
Copy link
Contributor

larroy commented Jun 14, 2019

I think this is the correct behaviour for global average pooling. I don't see why you would want to use global average pooling with the buggy behaviour. After looking into this, I would suggest to close this issue. Also per the documentation using global pooling with reset the kernel size to the full image. Also padding is not considered with using global pooling.

@larroy
Copy link
Contributor

larroy commented Jun 17, 2019

I think this is the first paper that proposed global pooling. https://arxiv.org/abs/1312.4400

@jmerkow
Copy link
Contributor Author

jmerkow commented Jun 17, 2019

@larroy. It’s not about correct behavior. Networks already trained with the buggy behavior will not work after updating. I was looking for a work around so we could update and without retraining. We have a number of networks trained with the buggy behavior in production. So this is effectively blocking us from upgrading. If you want to close the PR that’s fine, unless there is a reason to it might be good to leave around so people can use it if they have the same issues

@larroy
Copy link
Contributor

larroy commented Jun 17, 2019

I understand your problem, I'm sorry that this bug caused you problems when updating to more recent versions, I haven't had notice of other users being affected by this. Would you be willing to make a contribution to support having the old behavior? This could be a path forward for your case. Unfortunately like every resource, our energy is limited and we have to pick our battles, this seems like a corner case and how unfortunate as is I don't think we can dedicate efforts in supporting a past buggy behaviour for such an old version. If I may ask, why can't you retrain with a more recent version?

@jmerkow
Copy link
Contributor Author

jmerkow commented Jun 18, 2019

@larroy #15026

larroy added a commit to larroy/mxnet that referenced this issue Jun 18, 2019
This behaviour changed from older MXNet versions in which global pooling
would consider padding. This clarifies the user documentation.

See also apache#14421
larroy added a commit to larroy/mxnet that referenced this issue Jun 18, 2019
This behaviour changed from older MXNet versions in which global pooling
would consider padding. This clarifies the user documentation.

See also apache#14421
aaronmarkham pushed a commit that referenced this issue Jun 27, 2019
This behaviour changed from older MXNet versions in which global pooling
would consider padding. This clarifies the user documentation.

See also #14421
@sandeep-krishnamurthy
Copy link
Contributor

Closing this issue as discussed in the PR #15026

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants