[MXNET-1386] fix for shape mismatch #14728

samskalicky · 2019-04-17T22:49:44Z

Description

fixes #14727

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

…to shape_error

src/executor/graph_executor.cc

src/executor/trt_graph_executor.cc

stu1130

Could we have unit test for it?

samskalicky · 2019-04-22T23:10:06Z

@stu1130 I dont think theres an easy way to have a failing unit test. I tried making a small graph with the same structure, but it didnt error out. I dont think it would be good to use the faster-rcnn model (that produces this error as in #14727) as a unit test.

@reminisce or @zheng-da or @ZhennanQin can you help us come up with a failing unit test for this problem?

reminisce · 2019-04-23T03:08:51Z

@samskalicky I think you can use the diagram drawn in this picture as a unit test. The nodes just represent unary or binary ops. You can use some concrete operators such as _plus and _prod to build the graph and whitelist one of them for graph partitioning.

samskalicky · 2019-04-23T19:21:04Z

@reminisce heres what i tried, to reproduce the example graph from the issue, but it does not fail:

import os
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array
Batch = namedtuple('Batch', ['data'])

#setup whitelist
op_names = ['elemwise_mul']
check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"),
                                             mx_uint(len(op_names)),
                                             c_str_array(op_names)))
os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j

#bind data
ctx = mx.cpu()
c_data = mx.nd.ones((1),ctx=ctx)
e_data = mx.nd.ones((1),ctx=ctx)
f_data = mx.nd.ones((1),ctx=ctx)
h_data = mx.nd.ones((1),ctx=ctx)
k_data = mx.nd.ones((1),ctx=ctx)

args = {'c': c_data, 'e': e_data, 'h': h_data, 'k': k_data}
aux = {}

mod = mx.mod.Module(symbol=a,data_names=['f'],label_names=None,context=mx.cpu())
mod.bind(for_training=False, data_shapes=[('f',(1,))], label_shapes=mod._label_shapes)
mod.set_params(args,aux)

#export to symbol/params files
mod.save_checkpoint('test',0)

#reload from files
sym, arg_params, aux_params = mx.model.load_checkpoint('test', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None, data_names=['f'])
mod.bind(for_training=False, data_shapes=[('f', (1,))],
     label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#infer
mod.forward(Batch([mx.nd.ones((1))]))
print(mod.get_outputs())

reminisce · 2019-04-24T23:22:22Z

@samskalicky You just need to check the order of input names. See the following script

import os
import ctypes
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array, SymbolHandle
Batch = namedtuple('Batch', ['data'])

#setup whitelist
op_names = ['elemwise_mul', '_plus', '_Plus']
#os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j
sym = a

print(a.list_inputs())
# result: ['c', 'e', 'f', 'k', 'h']

subgraph_backend = 'default'
out = SymbolHandle()
check_call(_LIB.MXBuildSubgraphByOpNames(sym.handle, c_str(subgraph_backend), mx_uint(len(op_names)),
                                         c_str_array(op_names), ctypes.byref(out)))
psym = mx.sym.Symbol(out)
print(psym.list_inputs())
# result: ['c', 'e', 'f', 'h', 'k']

samskalicky · 2019-04-25T00:27:52Z

Thanks @reminisce ! thats a better way to check. But I just realized that my test code was missing the whole point of the problem (shapes mismatching) because I made all the data shapes (1)! Heres code with different shapes that does indeed fail:

import os
import mxnet as mx
from collections import namedtuple
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array
Batch = namedtuple('Batch', ['data'])

#setup whitelist                                                                                                                                                                                                  
op_names = ['elemwise_mul']
check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"),
                                             mx_uint(len(op_names)),
                                             c_str_array(op_names)))
os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

#setup graph                                                                                                                                                                                                      
e = mx.sym.var('e')
f = mx.sym.var('f')
c = mx.sym.var('c')
h = mx.sym.var('h')
k = mx.sym.var('k')

d = e * f
b = c * d
g = h * d
j = k + g
a = b + j

#bind data                                                                                                                                                                                                        
ctx = mx.cpu()
c_data = mx.nd.ones((1),ctx=ctx)
e_data = mx.nd.ones((2),ctx=ctx)
f_data = mx.nd.ones((3),ctx=ctx)
h_data = mx.nd.ones((4),ctx=ctx)
k_data = mx.nd.ones((5),ctx=ctx)

args = {'c': c_data, 'e': e_data, 'h': h_data, 'k': k_data}
aux = {}

mod = mx.mod.Module(symbol=a,data_names=['f'],label_names=None,context=mx.cpu())
mod.bind(for_training=False, data_shapes=[('f',(3,))], label_shapes=mod._label_shapes)
mod.set_params(args,aux)

#export to symbol/params files                                                                                                                                                                                    
mod.save_checkpoint('test',0)

#reload from files                                                                                                                                                                                                
sym, arg_params, aux_params = mx.model.load_checkpoint('test', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None, data_names=['f'])
mod.bind(for_training=False, data_shapes=[('f', (3,))],
     label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#infer                                                                                                                                                                                                            
mod.forward(Batch([mx.nd.ones((3,))]))
print(mod.get_outputs())

The failure is this message:

[17:27:24] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
Traceback (most recent call last):
  File "test.py", line 40, in <module>
    mod.set_params(args,aux)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 350, in set_params
    allow_extra=allow_extra)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 309, in init_params
    _impl(desc, arr, arg_params)
  File "/usr/local/lib/python2.7/site-packages/mxnet/module/module.py", line 297, in _impl
    cache_arr.copyto(arr)
  File "/usr/local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 2074, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/usr/local/lib/python2.7/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python2.7/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:27:24] src/operator/contrib/../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)) Incompatible attr in node  at 0-th output: expected [1], got [3]

reminisce · 2019-04-25T00:59:44Z

@samskalicky You are right. I missed the point too in the first place. You can add the failing test case to the unit test set to validate your fix. Thanks.

…to shape_error

samskalicky · 2019-05-02T17:41:57Z

@reminisce @stu1130 @ZhennanQin im unable to make a unit test for this. The code I showed above is erroring out irrespective of the bug/fix. Shape propagation in general is extremely restrictive and not does not result in the shapes im looking for.

Can someone help provide a working unit test?

reminisce · 2019-05-07T03:50:06Z

@samskalicky What's the error message did you still see with this fix?

pinaraws · 2019-05-20T16:40:23Z

@samskalicky What's the error message did you still see with this fix?

…to shape_error

samskalicky · 2019-05-30T18:49:38Z

@reminisce @pinaraws its the same error message as above. The simple example doesnt work because shape propagation cannot infer the different shapes correctly. With such a simple example there is no way to infer the specific shapes we're looking for.

So rather than hold up this PR further, I just added the fasterRCNN test as the unit test for this issue.

piyushghai · 2019-06-07T22:38:28Z

@samskalicky What's the path forward for this PR ?

vandanavk · 2019-06-16T19:09:55Z

@reminisce is this PR good to go?

src/executor/graph_executor.cc

karan6181 · 2019-07-19T00:55:13Z

@samskalicky Could you please address the review comments? Thanks!

piyushghai · 2019-08-02T19:52:43Z

@samskalicky Bouncing for an update ...

karan6181 · 2019-08-29T16:53:07Z

@samskalicky Gentle ping ..

samskalicky · 2019-10-03T16:29:54Z

@pengzhao-intel Im getting a weird failure for the MKL test_subgraph.py test, but all the other tests are passing. Heres one of the failing tests (from the unix-cpu job)

======================================================================
ERROR: test_subgraph.test_pos_conv_add2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 735, in test_pos_conv_add2
    check_fusion(net, data_shape, attrs)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 272, in check_fusion
    assert_almost_equal(exe.outputs[i].asnumpy(), exe_sg.outputs[i].asnumpy(), rtol=1e-3, atol=1e-1)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=927032378 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=22356853 to reproduce.
--------------------- >> end captured logging << ---------------------

Can someone from the Intel team help debug?

pengzhao-intel · 2019-10-06T09:08:27Z

@pengzhao-intel Im getting a weird failure for the MKL test_subgraph.py test, but all the other tests are passing. Heres one of the failing tests (from the unix-cpu job)

======================================================================
ERROR: test_subgraph.test_pos_conv_add2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 735, in test_pos_conv_add2
    check_fusion(net, data_shape, attrs)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 272, in check_fusion
    assert_almost_equal(exe.outputs[i].asnumpy(), exe_sg.outputs[i].asnumpy(), rtol=1e-3, atol=1e-1)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=927032378 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=22356853 to reproduce.
--------------------- >> end captured logging << ---------------------

Can someone from the Intel team help debug?

Yes, we are on the vocation and will take a look next week.
Could you file a new issue to track this failure?

ZhennanQin · 2019-10-09T01:05:39Z

@samskalicky I believe #15518 is able to address this issue. Can you try nightly build to see if the original network still has this trouble? Thanks.

ChaiBapchya · 2020-01-15T01:42:21Z

@ZhennanQin does #15518 address test_subgraph.test_pos_conv_add2 issue?

ZhennanQin · 2020-01-16T00:49:22Z

@ChaiBapchya test_subgraph.test_pos_conv_add2 doesn't failed in master, but failed with this PR. What I mean is, #15518 can resolve same problem(shape mismatch) without introducing test failed.

fix for shape mismatch

517d294

Roshrini requested a review from reminisce April 18, 2019 18:26

Ubuntu and others added 6 commits April 19, 2019 17:12

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

72e8a0c

…to shape_error

fixed whitespace

98e738c

fixed trt override of initarguments function

e8dd4ce

fix whitespace

dd0d86d

added fix for trt bind

6589761

retrigger ci

d53709d

Roshrini added the pr-awaiting-review PR is waiting for code review label Apr 19, 2019

yuxihu reviewed Apr 22, 2019

View reviewed changes

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

src/executor/trt_graph_executor.cc Outdated Show resolved Hide resolved

stu1130 reviewed Apr 22, 2019

View reviewed changes

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

02c49d9

…to shape_error

samskalicky added 4 commits May 30, 2019 01:15

changed names to reference, added unit test

b0f3bde

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

803768f

…to shape_error

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

ace7345

…to shape_error

fixed test

16d2c08

ptrendx reviewed Jun 25, 2019

View reviewed changes

src/executor/graph_executor.cc Show resolved Hide resolved

ptrendx reviewed Jun 25, 2019

View reviewed changes

src/executor/graph_executor.cc Show resolved Hide resolved

ptrendx reviewed Jun 25, 2019

View reviewed changes

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

larroy reviewed Jun 25, 2019

View reviewed changes

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

larroy reviewed Jun 25, 2019

View reviewed changes

src/executor/graph_executor.cc Show resolved Hide resolved

samskalicky requested review from aaronmarkham, anirudh2290, eric-haibin-lin, gigasquid, iblislin, marcoabreu, nswamy, sergeykolychev, szha and yzhliu as code owners October 2, 2019 23:47

samskalicky force-pushed the shape_error branch from 1c4f86a to 16d2c08 Compare October 3, 2019 00:14

Sam Skalicky added 7 commits October 3, 2019 00:15

Merge remote-tracking branch 'upstream/master' into shape_error

268a608

removed test

7674b08

removed other test changes

88e2c89

updated for bind and reshape flow

6c3cfea

added check for duplicate names

ef755c4

changed to unordered_map

6f90036

fixed whitespace

229ff48

samskalicky closed this Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-1386] fix for shape mismatch #14728

[MXNET-1386] fix for shape mismatch #14728

samskalicky commented Apr 17, 2019

stu1130 left a comment

samskalicky commented Apr 22, 2019 •

edited

Loading

reminisce commented Apr 23, 2019

samskalicky commented Apr 23, 2019

reminisce commented Apr 24, 2019

samskalicky commented Apr 25, 2019

reminisce commented Apr 25, 2019

samskalicky commented May 2, 2019

reminisce commented May 7, 2019

pinaraws commented May 20, 2019

samskalicky commented May 30, 2019

piyushghai commented Jun 7, 2019

vandanavk commented Jun 16, 2019

karan6181 commented Jul 19, 2019

piyushghai commented Aug 2, 2019

karan6181 commented Aug 29, 2019

samskalicky commented Oct 3, 2019 •

edited

Loading

pengzhao-intel commented Oct 6, 2019

ZhennanQin commented Oct 9, 2019

ChaiBapchya commented Jan 15, 2020

ZhennanQin commented Jan 16, 2020

[MXNET-1386] fix for shape mismatch #14728

[MXNET-1386] fix for shape mismatch #14728

Conversation

samskalicky commented Apr 17, 2019

Description

Checklist

Essentials

Changes

Comments

stu1130 left a comment

Choose a reason for hiding this comment

samskalicky commented Apr 22, 2019 • edited Loading

reminisce commented Apr 23, 2019

samskalicky commented Apr 23, 2019

reminisce commented Apr 24, 2019

samskalicky commented Apr 25, 2019

reminisce commented Apr 25, 2019

samskalicky commented May 2, 2019

reminisce commented May 7, 2019

pinaraws commented May 20, 2019

samskalicky commented May 30, 2019

piyushghai commented Jun 7, 2019

vandanavk commented Jun 16, 2019

karan6181 commented Jul 19, 2019

piyushghai commented Aug 2, 2019

karan6181 commented Aug 29, 2019

samskalicky commented Oct 3, 2019 • edited Loading

pengzhao-intel commented Oct 6, 2019

ZhennanQin commented Oct 9, 2019

ChaiBapchya commented Jan 15, 2020

ZhennanQin commented Jan 16, 2020

samskalicky commented Apr 22, 2019 •

edited

Loading

samskalicky commented Oct 3, 2019 •

edited

Loading