Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Subgraph API for integrating accelerators with MXNet #12157

Merged
merged 14 commits into from
Aug 31, 2018

Conversation

reminisce
Copy link
Contributor

@reminisce reminisce commented Aug 14, 2018

Description

This is a joint work of @reminisce and @zheng-da on implementing subgraph API for integrating backend accelerators (TVM, MKLDNN, TensorRT, nGraph, etc.) with MXNet. See design proposal.

Highlighted deliverables

  • A graph partitioning algorithm accepting custom partitioning rules (through subclassing SubgraphProperty).

  • A naive graph partitioning rule called DefaultSubgraphProperty partitioning graphs based upon provided operator names. For example, in the left graph, operators B and C are designated to be grouped into a subgraph operator node. After applying the partitioning algorithm, nodes B and C are replaced by a subgraph operator node in the graph on the right-hand-side.
    image

  • DefaultSubgraphOperator. A stateful operator storing a subgraph for execution by CachedOp.

  • NDArray version number for tracking mutations of NDArrays. This is useful, for example, in a stateful MKLDNN subgraph operator for checking whether input data has changed since the last forward call to decide if to trigger data format conversion.

  • Graph partitioning enabled in GraphExecutor for the forward pass. Triggered by an environment variable MXNET_SUBGRAPH_BACKEND.

Coming soon after this PR

  • Backward pass support in DefaultSubgraphOperator.
  • Support graph partitioning for Gluon/CacheOp.
  • Tutorial for developers on integrating backend accelerators using the Subgraph API.
  • Real backend accelerator integration with MXNet.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@pengzhao-intel
Copy link
Contributor

Thanks @reminisce it's really helpful for us :)

@ZhennanQin @lvtao, please take a look the compatibility of current code with our interface :)

nnvm::NodePtr n = nnvm::Node::Create();
n->attrs.op = Op::Get("_default_subgraph_op");
n->attrs.name = "_default_subgraph_op" + std::to_string(subgraph_id);
n->attrs.subgraphs.push_back(std::make_shared<nnvm::Symbol>(sym));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subgraph symbol is pushed back into n->attrs.subgraphs, how can we guarantee it is at index 0? Because for DefaultSubgraphOpNumInputs(), it always try to get symbol from index 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is a newly created node and its attrs.subgraphs is empty before push_back is called. It's actually up to you to decide which index is for the sym in your specific subgraph property. Just make sure you get the correct subgraph in your subgraph operator implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard that control dependence will use attrs.subgraphs as well. I'm not sure how this attribute will be used for dependence purpose. Do we have any instruction to define the sequence for attrs.subgraphs?Eg. always put subgraph symbol in index 0, and following dependence purpose node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, there is no fixed convention of doing this. @zheng-da
I don't think you need to worry about this if you can ensure the consistency between your specific subgraph property and subgraph operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ensure the consistency between subgraph property and subgraph operator, and we also need to ensure nobody changes it after subgraph partition. It's better to add some check for this consistency in case it breaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you can do to enforce the consistency is defining some enums as indices of forward graph, backward graph, fused graph, etc in attrs.subgraphs, and always use those enums to get the desired subgraphs. It also increases the code readability.


#define MXNET_REGISTER_SUBGRAPH_PROPERTY(Name, SubgraphPropertyType) \
static DMLC_ATTRIBUTE_UNUSED auto __make_ ## SubgraphPropertyType ## _ ## Name ## __ = \
SubgraphPropertyRegistry::Get()->__REGISTER__(#Name, &SubgraphPropertyType::Create)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement and call __REGISTER_OR_GET__ here (which checks if it is part of map else calls __REGISTER__). This may be needed because it is possible that __REGISTER__ may be called twice on the same name leading to exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended. The registration can only happen once.

Copy link
Member

@anirudh2290 anirudh2290 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With __REGISTER_OR_GET__ also register will happen only once. To explain an use case where register wont work: lets say I create a custom subgraph property and call MXNET_REGISTER_SUBGRAPH_PROPERTY from inside the namespace. Now i include this custom subgraph property header file in multiple source files. If they are part of different translation units __REGISTER__ called multiple times and fails since the static variable scope is limited to a single translation unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. I will fix it. Thanks.

@@ -436,7 +436,8 @@ inline void PushFComputeEx(const FComputeEx& fn,
}
};

if (exec_type == ExecType::kCrossDeviceCopy) {
if (exec_type == ExecType::kCrossDeviceCopy
|| exec_type == ExecType::kSubgraphExec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need kSubgraphExec here. kSubgraphExec only works for stateful operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is added by you. Nevertheless, isn't _default_subgraph_op a stateful operator? This should be needed to make sure this operator executes in the main thread in CachedOp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added this line. It's actually not necessary because this is PushFComputeEx, which only pushes FComputeEx. could you help me remove the line?

@@ -39,6 +39,7 @@
#include "../operator/tensor/matrix_op-inl.h"
#include "../operator/tensor/init_op.h"
#include "../operator/nn/mkldnn/mkldnn_base-inl.h"
#include "../engine/engine_impl.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to include this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

for (size_t i = 0; i < outputs.size(); ++i) {
LOG(INFO) << "outputs[" << i << "].version = " << outputs[i].version();
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we clean this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we are going to keep the debugging code for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Removed now.

PrintNodeEntry(*entries[i]);
}
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the debug code?

@@ -165,14 +178,22 @@ class NaiveEngine final : public Engine {
}
CHECK(this->req_completed_)
<< "NaiveEngine only support synchronize Push so far";
// increment var version
for (auto var : mutable_vars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of deletevariable, will this increment the var version after it is deleted ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Should move this before exec_fun runs.

++i1;
}
}
return PartitionGraph(src, prop_name, arg_shapes, arg_dtypes, arg_stypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for prop_name here. Currently only default is supported correct ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check inside CreateSubgraphProperty.

}
virtual SubgraphSelectorPtr CreateSubgraphSelector() const {
return std::make_shared<ContainOpSelector>(
this->GetAttr<std::unordered_set<std::string>>("op_names"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we override GetAttr for DefaultSubgraphProperty, so that it informs the user that the set has to be provided by the user using C API to test default subgraph property? This will also be a good example for developers writing custom subgraph property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a templated method. I don't think we can make it a virtual method. but I don't really know what you mean here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain my use case: Lets say a user forgot to call setattr for op_names when using the default_subgraph_property. It would fial with the error that op_names cannot be found in subgraph property. A better error message would have been useful. There may be such cases where classes subclassing SubgraphProperty may want to also implement logic for SetAttr and GetAttr in the subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not meant for users to use and is there only for the flexible testing purpose. Furthermore, template method cannot be virtual. But I still cannot understand your use case. How different could SetAttr and GetAttr be in a subclass than the base?

Copy link
Member

@anirudh2290 anirudh2290 Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example you only want certain attributes to be set for the subgraph property and you want to do some sanity check on whether the attrs can be supported or not. For example supported op_names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for your use case, i think it's fine without customization. Users of subgraph API can write their own code to do sanity check. The main problem is a templated method can't be virtual.

const std::map<std::string, Context>& ctx_map,
const std::vector<Context>& in_arg_ctxes,
const std::vector<Context>& aux_state_ctxes) {
auto subgraph_prop = op::SubgraphPropertyRegistry::Get()->CreateSubgraphProperty(prop_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If prop_name doesn't exist, we need to handle the failure of creating a subgraph property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check inside CreateSubgraphProperty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see.

++i1;
}
}
return PartitionGraph(src, prop_name, arg_shapes, arg_dtypes, arg_stypes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
virtual SubgraphSelectorPtr CreateSubgraphSelector() const {
return std::make_shared<ContainOpSelector>(
this->GetAttr<std::unordered_set<std::string>>("op_names"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a templated method. I don't think we can make it a virtual method. but I don't really know what you mean here.

@piiswrong
Copy link
Contributor

Let's not put this in example since the CAPIs are not meant to be called by users

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -42,6 +43,7 @@ using namespace mxnet::common;
GraphExecutor::GraphExecutor() {
log_verbose_ = dmlc::GetEnv("MXNET_EXEC_VERBOSE_LOGGING", false);
need_grad_ = false;
subgraph_property_ = dmlc::GetEnv("MXNET_SUBGRAPH_BACKEND", std::string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add MXNET_SUBGRAPH_BACKEND to env_var.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, there is no real accelerator integrated with MXNet. Intel team is going to submit their work on MKLDNN using this API set. We can add this evn var along with their PR.

return ret;
}

inline std::vector<ResourceRequest> DefaultSubgraphOpResourceRequest(const nnvm::NodeAttrs& attrs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function aggregates all resources requested by the subgraph. What are these acquired resources distributed to the individual operators in the subgraph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a convenience function for developers to register with their own subgraph operators. They may or may not use this function for registration depending on their implementation. For the _defafult_subgraph_operator, this is actually not necessary to be there. I have removed it from the registering _defafult_subgraph_operator.

// Determine if the node should be selected for a subgraph.
virtual bool Select(const nnvm::Node &n) = 0;
// Determine if the input node should be selected for a subgraph.
virtual bool SelectInput(const nnvm::Node &n, const nnvm::Node &new_node) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you document what are n and new_node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anirudh2290 anirudh2290 dismissed their stale review August 27, 2018 21:59

Review comments addressed

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work guys ! @reminisce @zheng-da

const size_t max_num_retry = simple_nodes.size() * simple_nodes.size();
size_t count = 0;
bool success = false;
while (!success && count < max_num_retry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the logic behind ssetting max_num_retry to square of simple_nodes.size(). Shouldnt it just be simple_nodes.size()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, yes. But just want to be safe here.

}
for (const auto& entry : top->inputs) {
// when searching for the ancestor, the path cannot cross any subgraph node
auto it = std::find(snodes.begin(), snodes.end(), entry.node.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add check for it == snodes.end() since it wont happen that it will find the input node in the subgraph nodes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either case is possible here, so cannot add the CHECK.

@eric-haibin-lin
Copy link
Member

LGTM

Graph partitioner and subgraph op

Fix duplicate entry bugs (apache#11767)

Make subgraph var node name unique (apache#11876)

[DO NOT REVIEW] Fix bug of eliminating cycles (apache#11907)

* Fix cycle bug

* Fix decycle bug

* Fix comment

[DO NOT REVIEW] Subgraph API (apache#12104)

* Initial commit

* Add unit tests

* Fix lint

* Fix lint

* Clean up

* Add graph partitiong to Bind

* Add property name to graph partitioning c api

* Fix unit test gpu context

* Address cr

* Move subgraph to attrs.subgraphs and fix the example

* Fix lint

* Add var version unit test

* Address cr

* Enable unit test that was flaky
@eric-haibin-lin
Copy link
Member

Looks like all comments are addressed. Merging it now

@eric-haibin-lin eric-haibin-lin merged commit a64cf7d into apache:master Aug 31, 2018
lebeg added a commit to lebeg/incubator-mxnet that referenced this pull request Sep 3, 2018
aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this pull request Sep 11, 2018
* Graph partitioner and subgraph op (apache#11251)

Graph partitioner and subgraph op

Fix duplicate entry bugs (apache#11767)

Make subgraph var node name unique (apache#11876)

[DO NOT REVIEW] Fix bug of eliminating cycles (apache#11907)

* Fix cycle bug

* Fix decycle bug

* Fix comment

[DO NOT REVIEW] Subgraph API (apache#12104)

* Initial commit

* Add unit tests

* Fix lint

* Fix lint

* Clean up

* Add graph partitiong to Bind

* Add property name to graph partitioning c api

* Fix unit test gpu context

* Address cr

* Move subgraph to attrs.subgraphs and fix the example

* Fix lint

* Add var version unit test

* Address cr

* Enable unit test that was flaky

* Clean up

* Clean up

* Clean up

* Change version return type in NDArray

* Clean up

* Add register or get for subgraph prop registry

* Address cr

* Remove unnecessary code

* Handle var version issue in naive engine

* Delete example

* Remove registration of resource request for default subgraph op

* Add doc string

* Improve doc string
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
* Graph partitioner and subgraph op (apache#11251)

Graph partitioner and subgraph op

Fix duplicate entry bugs (apache#11767)

Make subgraph var node name unique (apache#11876)

[DO NOT REVIEW] Fix bug of eliminating cycles (apache#11907)

* Fix cycle bug

* Fix decycle bug

* Fix comment

[DO NOT REVIEW] Subgraph API (apache#12104)

* Initial commit

* Add unit tests

* Fix lint

* Fix lint

* Clean up

* Add graph partitiong to Bind

* Add property name to graph partitioning c api

* Fix unit test gpu context

* Address cr

* Move subgraph to attrs.subgraphs and fix the example

* Fix lint

* Add var version unit test

* Address cr

* Enable unit test that was flaky

* Clean up

* Clean up

* Clean up

* Change version return type in NDArray

* Clean up

* Add register or get for subgraph prop registry

* Address cr

* Remove unnecessary code

* Handle var version issue in naive engine

* Delete example

* Remove registration of resource request for default subgraph op

* Add doc string

* Improve doc string
@reminisce reminisce mentioned this pull request Mar 4, 2019
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants