Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Current MXNet-Dev master breaks loading of certain models #15337

Closed
QueensGambit opened this issue Jun 23, 2019 · 9 comments
Closed

Current MXNet-Dev master breaks loading of certain models #15337

QueensGambit opened this issue Jun 23, 2019 · 9 comments
Labels
Backend Issues related to the backend of MXNet Bug

Comments

@QueensGambit
Copy link
Contributor

QueensGambit commented Jun 23, 2019

Description

The current MXNET master dev branch, pypi version 1.5.0b20190623 breaks the loading of certain MXNET-models (both in mxnet-mkl & mxnet-cu100), which previously were loaded successfully with mxnet==1.4.1.
The model uses grouped depthwise (a.ka. depthwise seperable) convolutions which could be the cause for this issue because other models (e.g. CrazyAraFish_0.5.0_RiseV1.zip) still work correctly as usual.

Environment info

I'm using python, but the same problem also occurs when building the MXNET-CPP package from source.

Error Message:

isready
self.symbol_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/symbol/model-1.19246-0.603-symbol.json
self.params_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/params/model-1.19246-0.603-0223.params
[00:35:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[00:35:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
Traceback (most recent call last):
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1623, in simple_bind
    ctypes.byref(exe_handle)))
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
  [bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
  [bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
  [bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
  [bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
  [bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
  [bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
  [bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
  [bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
  [bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "crazyara.py", line 668, in main
    self.setup_network()
  File "crazyara.py", line 166, in setup_network
    model_weights_dir=self.settings["model_weights_dir"]))
  File "/home/queensgambit/Programming/Deep_Learning/CrazyAra/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 95, in __init__
    force_rebind=True,
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1629, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 34, 8, 8)
force_rebind: True
Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
  [bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
  [bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
  [bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
  [bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
  [bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
  [bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
  [bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
  [bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
  [bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]

Minimum reproducible example

Steps to reproduce

Download release CrazyAra_0.5.0_RiseV2_mobile.zip at:

pip install python-chess

Extract CrazyAra_0.5.0_RiseV2_mobile.zip and run

$ python crazyara.py
$ uci
$ isready

from the commandline.
More details for install instructions can be found here:

Alternatively, you can load the mxnet model from the model/ directory manually in python.

Does someones have an idea what recent change causes this?
Can you include more automated unit tests for MXNET to ensure that the loading of different model types is preserved for version updates?

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@QueensGambit QueensGambit changed the title Current MXNET-Dev master breaks code for loading certain models Current MXNET-Dev master breaks loading certain models Jun 23, 2019
@QueensGambit QueensGambit changed the title Current MXNET-Dev master breaks loading certain models Current MXNET-Dev master breaks loading of certain models Jun 23, 2019
@QueensGambit
Copy link
Contributor Author

@mxnet-label-bot add [Bug]

@marcoabreu marcoabreu added the Bug label Jun 23, 2019
@QueensGambit
Copy link
Contributor Author

@mxnet-label-bot add [Backend]

@marcoabreu marcoabreu added the Backend Issues related to the backend of MXNet label Jun 23, 2019
@QueensGambit
Copy link
Contributor Author

This issue might also be related to #15281.

@QueensGambit QueensGambit changed the title Current MXNET-Dev master breaks loading of certain models Current MXNet-Dev master breaks loading of certain models Jun 25, 2019
@roywei
Copy link
Member

roywei commented Jul 19, 2019

Hi @QueensGambit I'm getting file not found error when following the steps to reproduce

I do have model-1.19246-0.603-0223.params under model/params

uciok

isready
info string The given batch_size 8 is higher than the number of threads 4. The maximum legal batch_size is the same as the number of threads (here: 4) 
info string The batch_size was reduced to 4
Traceback (most recent call last):
  File "crazyara.py", line 734, in main
    self.setup_network()
  File "crazyara.py", line 169, in setup_network
    model_weights_dir=self.settings["model_weights_dir"]))
  File "/Users/lawei/Downloads/CrazyAra_0.5.0_RiseV2_mobile/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 60, in __init__
    + '. Please make sure that the path has a "/" at the end of the path.'
Exception: No params file (.params) was found in your given model_weights_dir: ./model/params/. Please make sure that the path has a "/" at the end of the path.

@roywei
Copy link
Member

roywei commented Jul 19, 2019

Also I'm getting parameter not found when trying to load the symbol and params directly

>>> gluon.nn.SymbolBlock.imports("model-1.19246-0.603-symbol.json", ['data'], "model-1.19246-0.603-0223.params")

[13:25:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[13:25:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py:1159: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
	data: None
  input_sym_arg_type = in_param.infer_type()[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1037, in imports
    ret.collect_params().load(param_file, ctx=ctx, cast_dtype=True, dtype_source='saved')
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 960, in load
    ignore_extra, restore_prefix, filename, cast_dtype, dtype_source)
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 995, in load_dict
    name[lprefix:], error_str, _brief_print_list(arg_dict.keys()))
AssertionError: Parameter 'value_label' is missing in file: model-1.19246-0.603-0223.params, which contains parameters: 'stem_conv0_weight', 'stem_bn0_gamma', 'stem_bn0_beta', ..., 'value_bn0_moving_mean', 'value_bn0_moving_var', 'policy_bn0_moving_mean', 'policy_bn0_moving_var'. Please make sure source and target networks have the same prefix.

@QueensGambit
Copy link
Contributor Author

QueensGambit commented Jul 19, 2019

@roywei Thank your for the reply.
Sorry, for the inconvenience there was apparently a / missing in the relative path of the config-files which I released. I just updated the .zip-Release files and it should work again for MXNet 1.4.1.

This is the code how the model is currently loaded:
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/agent/neural_net_api.py#L66

        sym = mx.sym.load(self.symbol_path)
        # https://github.com/apache/incubator-mxnet/issues/6951
        save_dict = mx.nd.load(self.params_path)
        arg_params = {}
        aux_params = {}
        for key, val in save_dict.items():
            param_type, name = key.split(":", 1)
            if param_type == "arg":
                arg_params[name] = val
            if param_type == "aux":
                aux_params[name] = val
        # set the context on CPU, switch to GPU if there is one available
        if ctx == "cpu":
            self.ctx = mx.cpu()
        elif ctx == "gpu":
            self.ctx = mx.gpu()
        else:
            raise Exception("Unavailable ctx mode given %s. You must either select 'cpu' or 'gpu'" % ctx)
        # define batch_size times executor objects which are used for inference
        # one executor object is used for the currently requested batch batch length
        # the requested batch length is variable and at maximum the given batch_size
        self.executors = []
        for i in range(batch_size):
            executor = sym.simple_bind(
                ctx=self.ctx,
                # add a new length for each size starting with 1
                data=(i + 1, NB_CHANNELS_FULL, BOARD_HEIGHT, BOARD_WIDTH),
                grad_req="null",
                force_rebind=True,
            )
            executor.copy_params_from(arg_params, aux_params)
            self.executors.append(executor)

@QueensGambit
Copy link
Contributor Author

QueensGambit commented Jul 19, 2019

I think, I know why the loading fails, thank you for help @roywei. It's because I ported the training code from Gluon to MXNet for this model. The reason for this was that I experienced long delays during training due to MXNET_CUDNN_AUTOTUNE_DEFAULT calls:

Apparently in MXNet version 1.4.1 the code above works successfully and ignores the missing label information whereas version 1.5.0 blocks it, which is a behaviour I appreciate.

Using this code I'm able to successfully load the model both in version MXNet 1.4.1 and MXNet 1.5.0:

model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)

Consequently, this issue can be closed.

@devinhee
Copy link

See insightface #764

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Bug
Projects
None yet
Development

No branches or pull requests

5 participants