Add unoptimized symbol to executor for sharing #16798

ptrendx · 2019-11-13T01:09:05Z

Description

This PR adds additional symbol_ member to GraphExecutor, which contains the unoptimized symbol used by this executor. That symbol is then passed to the child executors when calling Reshape instead of the forward portion of the already optimized graph. This prevents a failure that would happen when the forward portion of the already optimized version of the symbol (with operators like _FusedOp which do not have FGradient method, as they are supposed to work on full graph) is passed to the new executor and the new executor tries to construct the full graph out of it.

@roywei @samskalicky FYI, please validate that this PR fixes the issue with Keras-MXNet integration.

Comments

This is a quick fix for the issue, but the right approach in the longer term would be to change the Reshape function in the executor to fully share the graph between the executor and its children instead of sharing only a forward portion of it and relying on the child executor to create the entire graph out of it - it is wasteful to recreate the full graph every time to do a simple reshape.
Subgraph API, since most of its logic sits inside Bind/SimpleBind, cannot be recreated from the unoptimized symbol. Since Subgraph API is used only for inference, it does not expose this issue. In order to not break the Subgraph API after reshape the symbol_ field is populated with the symbol after building the subgraph.

ptrendx · 2019-11-13T16:51:30Z

The error in the Scala CPU test (http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-16798/runs/1/nodes/302/steps/605/log/?start=0) seems to be similar to issue #12579 (deconvolution gradient being too big).

I will retrigger that test.

ptrendx · 2019-11-15T21:25:03Z

@roywei - Ping, could you test that it fixes your issue?

roywei · 2019-11-19T05:02:29Z

Thanks for the fix! The unit test in this PR passed on my local, but I'm still getting the error on keras-mxnet side.

What keras-mxnet is trying to do during module.reshape is: it's using Bucketing module to keep 3 graphs, and share the same set of parameters, 1 graph is used for training, 1 for validation and 1 for test per Keras front end requirement.

While I'm not sure if any other downstream project is still using MXNet this way, also I'm not sure if I can reproduce in pure MXNet in a meaningful way, I'm open to disable pointwise fusion at keras-mxnet side when loading MXNet.

1452/1563 [==========================>...] - ETA: 6s - loss: 1.5897 - acc: 0.4220Traceback (most recent call last):
  File "cifar10_resnet.py", line 426, in <module>
    callbacks=callbacks)
  File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1433, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1232, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py", line 5590, in train_function
    data, label, _, data_shapes, label_shapes = self._adjust_module(inputs, 'train')
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py", line 5534, in _adjust_module
    self._module._curr_module.reshape(data_shapes, label_shapes)
  File "/home/ubuntu/mxnet/python/mxnet/module/module.py", line 472, in reshape
    self._exec_group.reshape(self._data_shapes, self._label_shapes)
  File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 397, in reshape
self.bind_exec(data_shapes, label_shapes, reshape=True)
  File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 373, in bind_exec
    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
  File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 458, in reshape
    ctypes.byref(handle)))
  File "/home/ubuntu/mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [04:55:10] ../src/nnvm/gradient.cc:213: Operator _FusedOp is non-differentiable because it didn'
t register FGradient attribute.
Stack trace:
  [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7
f12f3234b70]
  [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(+0x8219fb4) [0x7f12f83a3fb4]
[bt] (2) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<nnvm::Graph (nnvm::Graph), nn
vm::Graph (*)(nnvm::Graph)>::_M_invoke(std::_Any_data const&, nnvm::Graph&&)+0x76) [0x7f12f83aa347]
  [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(std::function<nnvm::Graph (nnvm::Graph)>::operator()
(nnvm::Graph) const+0x60) [0x7f12f83b3b62]
  [bt] (4) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cx
x11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char,
 std::char_traits<char>, std::allocator<char> > > > const&)+0x424) [0x7f12fbcc32c9]
  [bt] (5) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPass(nnvm::Graph, std::__cxx11::basic_str
ing<char, std::char_traits<char>, std::allocator<char> > const&)+0xca) [0x7f12f7fe9036]
  [bt] (6) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(nnvm::pass::MXGradient(nnvm::Graph, std::vector<nnvm
::NodeEntry, std::allocator<nnvm::NodeEntry> >, std::vector<nnvm::NodeEntry, std::allocator<nnvm::NodeEntry> >, std::ve
ctor<nnvm::NodeEntry, std::allocator<nnvm::NodeEntry> >, std::function<nnvm::NodeEntry (std::vector<nnvm::NodeEntry, st
d::allocator<nnvm::NodeEntry> >&&)>, std::function<int (nnvm::Node const&)>, std::function<nnvm::NodeEntry (nnvm::NodeE
ntry const&, nnvm::NodeEntry const&)>, std::vector<nnvm::Op const*, std::allocator<nnvm::Op const*> >, std::__cxx11::ba
sic_string<char, std::char_traits<char>, std::allocator<char> >)+0x693) [0x7f12f80ea28a]
  [bt] (7) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::GraphExecutor::InitFullGraph(nnvm::Symb
ol, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0x6fd) [0x7f12f80d337f]
  [bt] (8) /home/ubuntu/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::GraphExecutor::InitGraph(nnvm::Symbol,
mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet:
:Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<s
td::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > con
st&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mx
net::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::OpReqTy
pe, std::allocator<mxnet::OpReqType> > const&)+0xa5) [0x7f12f80d95af]

ptrendx · 2019-11-19T17:34:35Z

@roywei Ok, let me try to repro using your script and debug further.

ptrendx · 2019-11-19T18:55:17Z

Yup, there was a small issue that the fix worked when doing Bind/SimpleBind (as those functions copied the symbol before optimizing it), but Reshape was passing the symbol as-is for optimization, modifying it in the process, so doing Reshape twice did not work. I made a change that fixes that and am currently running your test to make sure it runs through.
I will be introducing a test for multiple reshapes in a row too.

ptrendx · 2019-11-19T22:55:08Z

@roywei With the latest version of this PR your script passes for me locally - could you validate?

roywei

confirm keras-mxnet test passed! thanks!

* Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes

* Add unoptimized symbol to executor for sharing (#16798) * Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes * Mixed precison binary op backward (use in) for numpy (#16791) * mixed precison binary op backward * reduce unix cpu runtime * USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Add compile-guard to fusion. (#16838) * Rename USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Compile-guard fusion framework. * Fix fusion-not-supported warning. * Fix compile guards * Fix cmake build so -DMXNET_ENABLE_CUDA_RTC=1 is passed to nvcc * Minimize side-effects of prev change * Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph (#16836) * Fix the attribute inference omitting nodes * Add test * Cleaning * Fix lint * Fix TransposeShape * Fix WhileLoopType * Changing a/b test for fusion to a/(b+1) to increase numerical stability * Revert "Mixed precison binary op backward (use in) for numpy (#16791)" This reverts commit 8b58b78.

Add unoptimized symbol to executor for sharing

28603cd

ptrendx requested a review from DickJC123 November 13, 2019 01:12

ptrendx added the R1.6.0 label Nov 18, 2019

Copy the symbol in Reshape

47e194e

Added test for multiple reshapes

3256485

roywei approved these changes Nov 20, 2019

View reviewed changes

ptrendx merged commit 61c8baf into apache:master Nov 20, 2019

ptrendx added a commit to ptrendx/mxnet that referenced this pull request Nov 20, 2019

Add unoptimized symbol to executor for sharing (apache#16798)

e3e63fe

* Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes

ptrendx mentioned this pull request Nov 20, 2019

Backport #16798, #16836 and #16838 to 1.6 #16874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unoptimized symbol to executor for sharing #16798

Add unoptimized symbol to executor for sharing #16798

ptrendx commented Nov 13, 2019

ptrendx commented Nov 13, 2019

ptrendx commented Nov 15, 2019

roywei commented Nov 19, 2019 •

edited

Loading

ptrendx commented Nov 19, 2019

ptrendx commented Nov 19, 2019

ptrendx commented Nov 19, 2019

roywei left a comment

Add unoptimized symbol to executor for sharing #16798

Add unoptimized symbol to executor for sharing #16798

Conversation

ptrendx commented Nov 13, 2019

Description

Comments

ptrendx commented Nov 13, 2019

ptrendx commented Nov 15, 2019

roywei commented Nov 19, 2019 • edited Loading

ptrendx commented Nov 19, 2019

ptrendx commented Nov 19, 2019

ptrendx commented Nov 19, 2019

roywei left a comment

Choose a reason for hiding this comment

roywei commented Nov 19, 2019 •

edited

Loading