-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add unoptimized symbol to executor for sharing #16798
Conversation
The error in the Scala CPU test (http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-16798/runs/1/nodes/302/steps/605/log/?start=0) seems to be similar to issue #12579 (deconvolution gradient being too big). I will retrigger that test. |
@roywei - Ping, could you test that it fixes your issue? |
Thanks for the fix! The unit test in this PR passed on my local, but I'm still getting the error on keras-mxnet side. What keras-mxnet is trying to do during While I'm not sure if any other downstream project is still using MXNet this way, also I'm not sure if I can reproduce in pure MXNet in a meaningful way, I'm open to disable pointwise fusion at keras-mxnet side when loading MXNet.
|
@roywei Ok, let me try to repro using your script and debug further. |
Yup, there was a small issue that the fix worked when doing |
@roywei With the latest version of this PR your script passes for me locally - could you validate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirm keras-mxnet test passed! thanks!
* Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes
* Add unoptimized symbol to executor for sharing (#16798) * Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes * Mixed precison binary op backward (use in) for numpy (#16791) * mixed precison binary op backward * reduce unix cpu runtime * USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Add compile-guard to fusion. (#16838) * Rename USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Compile-guard fusion framework. * Fix fusion-not-supported warning. * Fix compile guards * Fix cmake build so -DMXNET_ENABLE_CUDA_RTC=1 is passed to nvcc * Minimize side-effects of prev change * Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph (#16836) * Fix the attribute inference omitting nodes * Add test * Cleaning * Fix lint * Fix TransposeShape * Fix WhileLoopType * Changing a/b test for fusion to a/(b+1) to increase numerical stability * Revert "Mixed precison binary op backward (use in) for numpy (#16791)" This reverts commit 8b58b78.
Description
Fixes #16785.
This PR adds additional
symbol_
member toGraphExecutor
, which contains the unoptimized symbol used by this executor. That symbol is then passed to the child executors when callingReshape
instead of the forward portion of the already optimized graph. This prevents a failure that would happen when the forward portion of the already optimized version of the symbol (with operators like_FusedOp
which do not haveFGradient
method, as they are supposed to work on full graph) is passed to the new executor and the new executor tries to construct the full graph out of it.@roywei @samskalicky FYI, please validate that this PR fixes the issue with Keras-MXNet integration.
Comments
Reshape
function in the executor to fully share the graph between the executor and its children instead of sharing only a forward portion of it and relying on the child executor to create the entire graph out of it - it is wasteful to recreate the full graph every time to do a simple reshape.Bind
/SimpleBind
, cannot be recreated from the unoptimized symbol. Since Subgraph API is used only for inference, it does not expose this issue. In order to not break the Subgraph API after reshape thesymbol_
field is populated with the symbol after building the subgraph.