Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Dynamic subgraph compile support #17623

Merged
merged 56 commits into from
Mar 19, 2020
Merged

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Feb 19, 2020

Description

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

  • Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
  • Modifies the subgraph library example to optionally require args to be provided
  • Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
  • Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

  • Adds a new test to partition operators that directly consume params
  • add a new model to test where ops to be partitioned have args/params

Bug Fixes

  • fixes bug in passing ids vector by value instead of by reference
  • fixes bug in passing copies of attributes instead of by reference
  • fixes bug where _cached_graph was not updated after partitioning
  • fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
  • fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

  • Updates the README doc with the latest changes described above

Design - Passing NDArrays

In #15886 the optimize_for API was added to the Symbol class to give users an easy API to use to partition their models. The args argument took the params to the model to use for shape/type inference. But the NDArray values were never used. In this PR, we pass the NDArray data values to the backend library and also add the ability to pass auxiliary params too. Now, the optimize_for API looks like:

sym = sym.optimize_for('default', args, aux, ctx=mx.cpu())

On the backend library side, the reviewSubgraph API has an additional arguments for args and aux that are maps of named MXTensors.

MXReturnValue reviewSubgraph(std::string json, int subraph_id, bool* accept,
                             std::unordered_map<std::string, std::string>& options,
                             std::unordered_map<std::string, std::string>& attrs) {
                             std::unordered_map<std::string, std::string>& attrs,
                             std::map<std::string, MXTensor>& args,
                             std::map<std::string, MXTensor>& aux);

These additional maps of args and aux will allow backends to compile subgraphs and use param/weight values during compilation. For example, this will enable TensorRT to be implemented as a backend library and eliminate the init_tensorrt_params API that is needed to provide the params to the TensorRT backend. It will also enable compiling subgraphs with TVM and other compile-centric backends.

This PR also tags subgraph inputs with a new attribute "argName" if the input to the subgraph comes directly from a model param/weight. This will be used when backend libraries lookup params for subgraph inputs since the names of subgraph inputs are modified from the original model param name. The "argName" can be looked up in the new args or aux arguments passed to the reviewSubgraph API to get the actual tensor values. Heres an example partitioned graph, notice that the original input param is named a and the input to the subgraph is named a0. You can see the "argName" attribute is set on the input refer to original name a.

{
  "nodes": [
    {
      "op": "null", 
      "name": "a", 
      "attrs": {
        "dtype": "0", 
        "shape": "[3,2]"
      }, 
      "inputs": []
    }, 
    {
      "op": "_custom_subgraph_op", 
      "name": "_op0", 
      "attrs": {"myKey": "myVal"}, 
      "inputs": [[0, 0, 0]], 
      "subgraphs": [
        {
          "nodes": [
            {
              "op": "null", 
              "name": "a0", 
              "attrs": {
                "argName": "a", 
                "isArg": "True", 
                "isAux": "False"
              }, 
              "inputs": []
            }, 

            ...

        }
      ]
    }
  ], 
}

Design - Partitioning HybridBlock without infer

In #15969 the hybridize API was modified to accept the backend name and partition the model during the hybridize flow. This flow is for users who intend to run inference immediately after partitioning. But for users that want to partition but not run inference, they are out of luck. When compiling the model, a user will compile on a machine suitable for compilation. And that compilation may take tens of minutes depending on the model and optimization strategy. Then they will export their model, and copy it to the machine they intent to run inference on.

In this flow, the machine for running compilation may not be suitable for running a complete forward pass. In this PR we add an optimize_for API to the HybridBlock class that runs most of what is part of the hybridize flow including the forward pass. But it does not actually call the cachedOp.

The new HybridBlock optimize_for API combines the argument lists of forward and hybridize like:

def optimize_for(self, x, *args, backend=None, backend_opts=None, **kwargs)

Notice that the first two args are the same as in the forward API and the last 3 args are the same as in the hybridize API. The active argument is dropped since it will default to true in this function in order to partition the model.

The expectation is that users will do the following:

block.optimize_for(x, backend='myProp')
block.export('partitioned')

This is equivalent to:

block.hybridize(backend='myProp')
block(x)
block.export('partitioned')

But does not actually execute the forward pass. Users can still use the partitioned block to run forward passes just like with hybridize, for example:

block.optimize_for(x, backend='myProp')
block(x)

Calling optimize_for is equivalent to hybridize in this usage, and since the cachedOp is already created before block(x), it is not recreated during the forward pass.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@samskalicky
Copy link
Contributor Author

If #17585 gets merged first, then ill update the doc with the new compile info. If this one gets merged first ill got modify #17585 to add the description.

@samskalicky
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 19, 2020
src/c_api/c_api.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samskalicky samskalicky mentioned this pull request Feb 19, 2020
4 tasks
@samskalicky
Copy link
Contributor Author

What do you think @HahTK?

@HahTK
Copy link

HahTK commented Feb 21, 2020

It sounds like the goal for passing data is to allow data to be compiled into the bin (example tensort bin) for that subgraph to avoid an init step.

If the weights are in the bin, then we structuring this such that weights can not be changed with calling optimize_for again using new weights ?

Is there any reason why we only do this for args and not auxs ?

@samskalicky
Copy link
Contributor Author

It sounds like the goal for passing data is to allow data to be compiled into the bin (example tensort bin) for that subgraph to avoid an init step.

If the weights are in the bin, then we structuring this such that weights can not be changed with calling optimize_for again using new weights ?

Using the weights in optimize_for is a one-way flow. You cannot call optimize_for again with new weights. You would need to call it on the original graph with new weights. The assumption is that weights would only be used for inference. So presumably the model is already trained, and the weights are frozen.

Is there any reason why we only do this for args and not auxs ?

Thanks for pointing this out. I'll add them too

int idx = 0;
// find the beginning of the output shape for the particular output index
for (unsigned x=0; x < orig.index; x++)
idx = shape.find("[", idx+1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if idx points to the right index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

int idx = 0;
// find the beginning of the output dtype for the particular output index
for (unsigned x=0; x < orig.index; x++)
idx = dtype.find("[", idx+1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify this for list of integers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

for (unsigned i=0; i < sym.outputs.size(); i++) {
nnvm::Node* n = sym.outputs[i].node.get();
if (n->attrs.dict.count("__shape__") > 0) {
std::string& shape = n->attrs.dict["__shape__"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify logic for the case when n is a subgraph node and shape a list of lists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@samskalicky
Copy link
Contributor Author

@mseth10 thanks! lmk what you think of the latest commit

@mseth10
Copy link
Contributor

mseth10 commented Mar 17, 2020

@mseth10 thanks! lmk what you think of the latest commit

All latest changes look good to me. Thanks @samskalicky for addressing multiple subgraph issues in this PR.

@leezu
Copy link
Contributor

leezu commented Mar 17, 2020

@ptrendx any concerns on TensorRT or is this good to merge?

@leezu leezu requested a review from ptrendx March 17, 2020 21:24
ss << ",";
}
ss << "]";
n->attrs.dict["__shape__"] = ss.str();
Copy link
Contributor Author

@samskalicky samskalicky Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use MX_STR_SHAPE instead of "__shape__"

ss << ",";
}
ss << "]";
n->attrs.dict["__dtype__"] = ss.str();
Copy link
Contributor Author

@samskalicky samskalicky Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use MX_STR_DTYPE instead of "__dtype__"

@samskalicky
Copy link
Contributor Author

samskalicky commented Mar 18, 2020

@leezu, Reviewed offline with @ptrendx and @Caenorst. Decided that there are a few missing features needed for TensorRT to use this compile API. But decided to list them here #17236 and implement them in a subsequent PR (#17885). We can go ahead and merge this one as it is still useful for compiling for TVM.

After this PR is merged i'll open another PR and start working on the missing features next.

@leezu leezu merged commit f7c4323 into apache:master Mar 19, 2020
anirudh2290 added a commit to anirudh2290/mxnet that referenced this pull request Mar 27, 2020
* 'master' of https://github.com/apache/incubator-mxnet: (192 commits)
  * impl - FFI for np einsum (apache#17869)
  [Numpy] FFI for diag/diagonal/diag_indices_from (apache#17789)
  [Numpy] Kron operator (apache#17323)
  cmake: Set DMLC_LOG_FATAL_THROW only for building mxnet and not for tvm (apache#17878)
  Add simplified HybridBlock.forward without F (apache#17530)
  Use FP32 copy of weights for norm (multitensor LAMB optimizer) (apache#17700)
  Use multi-tensor sumSQ in clip_global_norm (apache#17652)
  [Numpy] Add op fmax, fmin, fmod (apache#17567)
  Adding sparse support to MXTensor for custom operators (apache#17569)
  Update 3rdparty/mkldnn to v1.2.2 (apache#17313)
  Dynamic subgraph compile support (apache#17623)
  Refactor cpp-package CMakeLists.txt & add missing inference/imagenet_inference (apache#17835)
  staticbuild: Fix potential user-assisted execution of arbitrary code  (apache#17860)
  * FFI for np.argmax and np.argmin (apache#17843)
  ffi for roll/rot90 (apache#17861)
  Skip test_multi_worker_dataloader_release_pool on OS X (apache#17797)
  add ffi for full_like, binary (apache#17811)
  HybridBlock.export() to return created filenames (apache#17758)
  Fix SoftReLU fused operator numerical stability (apache#17849)
  CI: Test clang10 cpu & gpu builds with -WError (apache#17830)
  ...
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above
samskalicky added a commit to samskalicky/incubator-mxnet that referenced this pull request Apr 15, 2020
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above
pengzhao-intel pushed a commit that referenced this pull request Apr 16, 2020
…18069)

* Dynamic subgraph compile support (#17623)

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above

* Adding sparse support to MXTensor for custom operators (#17569)

* Added enum for sparse storage

* Add structure for Dense and Sparse

* redesign the data structure for MXSparse

* pull out aux data from sparse NDArray

* Added more sparse arguments to API interface

* Passed sparse from c_api to lib_api.h and set in MXTensor

* Fix indent

* fix segfault

* Fix NDArray to MXTensor errors

* Add a sample of sparse(CSR) transpose

* Make CSR transpose temporarily work by hardcoding

* Fixed sparse output size(Refined)

* Add tests for symbolic and stateful ops

* Added a sample for row sparse transpose

* Added real row sparse transpose

* Fix output size issue by adding lambda for CheckAndAlloc()

* Fix mixed storage formats error

* Added infer storage type function

* resolve comments

* Set inferSType as optional function

* Resolve comments

* Add error messages

* Resolve comments

* verify transpose ops results

* fix sanity check

* update MX_LIBRARY_VERSION to 5

* Custom Operator Random Number Generator Support (#17762)

Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project #15921 and #17270

Co-authored-by: guanxinq <[email protected]>
Co-authored-by: Ziyi Mu <[email protected]>
pengzhao-intel pushed a commit that referenced this pull request Apr 16, 2020
* Dynamic subgraph compile support (#17623)

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above

* Adding sparse support to MXTensor for custom operators (#17569)

* Added enum for sparse storage

* Add structure for Dense and Sparse

* redesign the data structure for MXSparse

* pull out aux data from sparse NDArray

* Added more sparse arguments to API interface

* Passed sparse from c_api to lib_api.h and set in MXTensor

* Fix indent

* fix segfault

* Fix NDArray to MXTensor errors

* Add a sample of sparse(CSR) transpose

* Make CSR transpose temporarily work by hardcoding

* Fixed sparse output size(Refined)

* Add tests for symbolic and stateful ops

* Added a sample for row sparse transpose

* Added real row sparse transpose

* Fix output size issue by adding lambda for CheckAndAlloc()

* Fix mixed storage formats error

* Added infer storage type function

* resolve comments

* Set inferSType as optional function

* Resolve comments

* Add error messages

* Resolve comments

* verify transpose ops results

* fix sanity check

* update MX_LIBRARY_VERSION to 5

* Custom Operator Random Number Generator Support (#17762)

Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project #15921 and #17270

Co-authored-by: guanxinq <[email protected]>
Co-authored-by: Ziyi Mu <[email protected]>
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants