Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1386] fix for shape mismatch #14728

Closed
wants to merge 19 commits into from

Conversation

samskalicky
Copy link
Contributor

Description

fixes #14727

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@Roshrini Roshrini requested a review from reminisce April 18, 2019 18:26
@Roshrini Roshrini added the pr-awaiting-review PR is waiting for code review label Apr 19, 2019
src/executor/graph_executor.cc Outdated Show resolved Hide resolved
src/executor/graph_executor.cc Outdated Show resolved Hide resolved
src/executor/trt_graph_executor.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@stu1130 stu1130 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have unit test for it?

@samskalicky
Copy link
Contributor Author

samskalicky commented Apr 22, 2019

@stu1130 I dont think theres an easy way to have a failing unit test. I tried making a small graph with the same structure, but it didnt error out. I dont think it would be good to use the faster-rcnn model (that produces this error as in #14727) as a unit test.

@reminisce or @zheng-da or @ZhennanQin can you help us come up with a failing unit test for this problem?

@reminisce
Copy link
Contributor

@samskalicky I think you can use the diagram drawn in this picture as a unit test. The nodes just represent unary or binary ops. You can use some concrete operators such as _plus and _prod to build the graph and whitelist one of them for graph partitioning.

@samskalicky
Copy link
Contributor Author

@reminisce heres what i tried, to reproduce the example graph from the issue, but it does not fail:

import os
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array
Batch = namedtuple('Batch', ['data'])

#setup whitelist
op_names = ['elemwise_mul']
check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"),
                                             mx_uint(len(op_names)),
                                             c_str_array(op_names)))
os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j

#bind data
ctx = mx.cpu()
c_data = mx.nd.ones((1),ctx=ctx)
e_data = mx.nd.ones((1),ctx=ctx)
f_data = mx.nd.ones((1),ctx=ctx)
h_data = mx.nd.ones((1),ctx=ctx)
k_data = mx.nd.ones((1),ctx=ctx)

args = {'c': c_data, 'e': e_data, 'h': h_data, 'k': k_data}
aux = {}

mod = mx.mod.Module(symbol=a,data_names=['f'],label_names=None,context=mx.cpu())
mod.bind(for_training=False, data_shapes=[('f',(1,))], label_shapes=mod._label_shapes)
mod.set_params(args,aux)

#export to symbol/params files
mod.save_checkpoint('test',0)

#reload from files
sym, arg_params, aux_params = mx.model.load_checkpoint('test', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None, data_names=['f'])
mod.bind(for_training=False, data_shapes=[('f', (1,))],
     label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#infer
mod.forward(Batch([mx.nd.ones((1))]))
print(mod.get_outputs())

@reminisce
Copy link
Contributor

@samskalicky You just need to check the order of input names. See the following script

import os
import ctypes
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array, SymbolHandle
Batch = namedtuple('Batch', ['data'])

#setup whitelist
op_names = ['elemwise_mul', '_plus', '_Plus']
#os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j
sym = a

print(a.list_inputs())
# result: ['c', 'e', 'f', 'k', 'h']

subgraph_backend = 'default'
out = SymbolHandle()
check_call(_LIB.MXBuildSubgraphByOpNames(sym.handle, c_str(subgraph_backend), mx_uint(len(op_names)),
                                         c_str_array(op_names), ctypes.byref(out)))
psym = mx.sym.Symbol(out)
print(psym.list_inputs())
# result: ['c', 'e', 'f', 'h', 'k']

@samskalicky
Copy link
Contributor Author

Thanks @reminisce ! thats a better way to check. But I just realized that my test code was missing the whole point of the problem (shapes mismatching) because I made all the data shapes (1)! Heres code with different shapes that does indeed fail:

import os
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array
Batch = namedtuple('Batch', ['data'])

#setup whitelist                                                                                                                                                                                                  
op_names = ['elemwise_mul']
check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"),
                                             mx_uint(len(op_names)),
                                             c_str_array(op_names)))
os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph                                                                                                                                                                                                      
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j

#bind data                                                                                                                                                                                                        
ctx = mx.cpu()
c_data = mx.nd.ones((1),ctx=ctx)
e_data = mx.nd.ones((2),ctx=ctx)
f_data = mx.nd.ones((3),ctx=ctx)
h_data = mx.nd.ones((4),ctx=ctx)
k_data = mx.nd.ones((5),ctx=ctx)

args = {'c': c_data, 'e': e_data, 'h': h_data, 'k': k_data}
aux = {}

mod = mx.mod.Module(symbol=a,data_names=['f'],label_names=None,context=mx.cpu())
mod.bind(for_training=False, data_shapes=[('f',(3,))], label_shapes=mod._label_shapes)
mod.set_params(args,aux)

#export to symbol/params files                                                                                                                                                                                    
mod.save_checkpoint('test',0)

#reload from files                                                                                                                                                                                                
sym, arg_params, aux_params = mx.model.load_checkpoint('test', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None, data_names=['f'])
mod.bind(for_training=False, data_shapes=[('f', (3,))],
     label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#infer                                                                                                                                                                                                            
mod.forward(Batch([mx.nd.ones((3,))]))
print(mod.get_outputs())

The failure is this message:

[17:27:24] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
Traceback (most recent call last):
  File "test.py", line 40, in <module>
    mod.set_params(args,aux)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 350, in set_params
    allow_extra=allow_extra)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 309, in init_params
    _impl(desc, arr, arg_params)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 297, in _impl
    cache_arr.copyto(arr)
  File "/usr/local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 2074, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/usr/local/lib/python2.7/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python2.7/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:27:24] src/operator/contrib/../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)) Incompatible attr in node  at 0-th output: expected [1], got [3]

@reminisce
Copy link
Contributor

@samskalicky You are right. I missed the point too in the first place. You can add the failing test case to the unit test set to validate your fix. Thanks.

@samskalicky
Copy link
Contributor Author

@reminisce @stu1130 @ZhennanQin im unable to make a unit test for this. The code I showed above is erroring out irrespective of the bug/fix. Shape propagation in general is extremely restrictive and not does not result in the shapes im looking for.

Can someone help provide a working unit test?

@reminisce
Copy link
Contributor

@samskalicky What's the error message did you still see with this fix?

1 similar comment
@pinaraws
Copy link

@samskalicky What's the error message did you still see with this fix?

@samskalicky
Copy link
Contributor Author

@reminisce @pinaraws its the same error message as above. The simple example doesnt work because shape propagation cannot infer the different shapes correctly. With such a simple example there is no way to infer the specific shapes we're looking for.

So rather than hold up this PR further, I just added the fasterRCNN test as the unit test for this issue.

@piyushghai
Copy link
Contributor

@samskalicky What's the path forward for this PR ?

@vandanavk
Copy link
Contributor

@reminisce is this PR good to go?

@karan6181
Copy link
Contributor

@samskalicky Could you please address the review comments? Thanks!

@piyushghai
Copy link
Contributor

@samskalicky Bouncing for an update ...

@karan6181
Copy link
Contributor

@samskalicky Gentle ping ..

@samskalicky
Copy link
Contributor Author

samskalicky commented Oct 3, 2019

@pengzhao-intel Im getting a weird failure for the MKL test_subgraph.py test, but all the other tests are passing. Heres one of the failing tests (from the unix-cpu job)

======================================================================
ERROR: test_subgraph.test_pos_conv_add2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 735, in test_pos_conv_add2
    check_fusion(net, data_shape, attrs)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 272, in check_fusion
    assert_almost_equal(exe.outputs[i].asnumpy(), exe_sg.outputs[i].asnumpy(), rtol=1e-3, atol=1e-1)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=927032378 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=22356853 to reproduce.
--------------------- >> end captured logging << ---------------------

Can someone from the Intel team help debug?

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel Im getting a weird failure for the MKL test_subgraph.py test, but all the other tests are passing. Heres one of the failing tests (from the unix-cpu job)

======================================================================
ERROR: test_subgraph.test_pos_conv_add2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 735, in test_pos_conv_add2
    check_fusion(net, data_shape, attrs)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 272, in check_fusion
    assert_almost_equal(exe.outputs[i].asnumpy(), exe_sg.outputs[i].asnumpy(), rtol=1e-3, atol=1e-1)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=927032378 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=22356853 to reproduce.
--------------------- >> end captured logging << ---------------------

Can someone from the Intel team help debug?

Yes, we are on the vocation and will take a look next week.
Could you file a new issue to track this failure?

@ZhennanQin
Copy link
Contributor

@samskalicky I believe #15518 is able to address this issue. Can you try nightly build to see if the original network still has this trouble? Thanks.

@ChaiBapchya
Copy link
Contributor

@ZhennanQin does #15518 address test_subgraph.test_pos_conv_add2 issue?

@ZhennanQin
Copy link
Contributor

@ChaiBapchya test_subgraph.test_pos_conv_add2 doesn't failed in master, but failed with this PR. What I mean is, #15518 can resolve same problem(shape mismatch) without introducing test failed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shape input names order mismatch after partitioning