-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Updating mxnet from 1.0.0, networks give different outputs #14421
Comments
Hey, this is the MXNet Label Bot. |
Do you have a sample model which can be shared to reproduce this? There is a backwards compatibility checker for models as part of the nightly pipeline. It trains models on earlier releases and checks for consistency on the latest builds. It's possible there is an edge case which is being missed. @mxnet-label-bot add [Pending Requestor Info, Bug] |
Could I run the tool and pass the output instead?
|
How do you use your model to inference? If you are using mx.module, please make sure the parameter for_training is set to be 0. |
@jmerkow The tool does save and load the same weights and checks the forward pass on the model. |
Ok I am working on this again. I've checked that the parameter files are the same. i.e. something like this:
I double checked that the actual input to the image is the same (after transforms etc). We use the module doing something like this:
EDIT: Ok I went through and save the outputs of each layer, and I tracked down the first output that has a difference and its a run-of-the-mill 2D convolution layer:
|
Doing some more analysis, it looks like there are some differences at the convolution layer described above, but the changes are relatively minor. However, there is a global pooling layer at the end of the network which seems to have a VERY VERY large difference. I'm using the following to calculate error:
the error values are all less than 0.1, except after the global pooling layer i get Was there some change to global pooling that would cause this?
Also looking at the network itself, the layer going into the global pool is 5x5, but the kernel is set to 8x8. |
Ok I may have figured it out. It appears that now global pooling ignores padding (including Pooling_v1). If i set padding=0 on the global pooling layer i get the same values (incorrect) in mx 1.0.0 as mx 1.4.0. Is there a way to prevent this? I'd like pooling to use the padding values when calculating the global pool. |
Here is a small example that you can use to reproduce the issue:
MX 1.0.0 output:
MX 1.4.0 output:
This appears to be a bug with #9730... |
So the correct behaviour is the new behaviour, are you requesting a flag to have the option to use the old incorrect behaviour? |
I think this is the correct behaviour for global average pooling. I don't see why you would want to use global average pooling with the buggy behaviour. After looking into this, I would suggest to close this issue. Also per the documentation using global pooling with reset the kernel size to the full image. Also padding is not considered with using global pooling. |
I think this is the first paper that proposed global pooling. https://arxiv.org/abs/1312.4400 |
@larroy. It’s not about correct behavior. Networks already trained with the buggy behavior will not work after updating. I was looking for a work around so we could update and without retraining. We have a number of networks trained with the buggy behavior in production. So this is effectively blocking us from upgrading. If you want to close the PR that’s fine, unless there is a reason to it might be good to leave around so people can use it if they have the same issues |
I understand your problem, I'm sorry that this bug caused you problems when updating to more recent versions, I haven't had notice of other users being affected by this. Would you be willing to make a contribution to support having the old behavior? This could be a path forward for your case. Unfortunately like every resource, our energy is limited and we have to pick our battles, this seems like a corner case and how unfortunate as is I don't think we can dedicate efforts in supporting a past buggy behaviour for such an old version. If I may ask, why can't you retrain with a more recent version? |
This behaviour changed from older MXNet versions in which global pooling would consider padding. This clarifies the user documentation. See also apache#14421
This behaviour changed from older MXNet versions in which global pooling would consider padding. This clarifies the user documentation. See also apache#14421
This behaviour changed from older MXNet versions in which global pooling would consider padding. This clarifies the user documentation. See also #14421
Closing this issue as discussed in the PR #15026 |
I am working in a production environment, where have some networks implemented in mxnet 1.0.0. I am working updating our systems and trying to push to the latest mxnet (1.4.x as of now) but when we upgrade our networks produce different outputs.
We are using symbols saved to json files and arg/aux_params stored in .params files. these were all produced by mxnet 1.0.0 or earlier.
When using the latest mxnet (or 1.4.x) we are getting different outputs for the same inputs, with our saved models. I have been trying to use git bisect or slowly upgrading versions to figure out where this breaking change occurred but there are issues with your git history and/or some strange (compiler??) incompatibilities which prevent getting a clean checkout/build for nearly all of the intermediate versions…
These are VERY different outputs, im rounding to about the 5 decimal point, or higher, so these aren’t numerical differences.
And we are using modules, not gluons (part of the point is upgrade and move towards gluons)
Does anyone have any idea what could be causing this? Or what to look into to solve it?
The text was updated successfully, but these errors were encountered: