Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-876] make CachedOp a normal operator #11641

Merged
merged 16 commits into from
Sep 23, 2018

Conversation

zheng-da
Copy link
Contributor

@zheng-da zheng-da commented Jul 11, 2018

Description

Currently, CachedOp is used to execute the graph in a Gluon hybrid block when the block is hybridized. It's registered as an operator, but it doesn't have a full set of operator attributes. So it can't be used as a regular operator and can't be used in a normal NNVM computation graph. This PR is to extend CachedOp and make it a normal operator. The main motivation is to use it as a default subgraph operator, as proposed in Unified integration with external acceleration libraries.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@@ -1073,6 +1073,9 @@ MXNET_DLL int MXSymbolGetInputSymbols(SymbolHandle sym, SymbolHandle **inputs,
MXNET_DLL int MXSymbolCutSubgraph(SymbolHandle sym, SymbolHandle **inputs,
int *input_size);

int MXMakeSubgraph(SymbolHandle sym, SymbolHandle *input_symbols, mx_uint num_inputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc missing

@@ -336,3 +336,10 @@ def check_data(inputs, in_type, msg):
states = states[0]

return (outs, states)

def make_subgraph(subg, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc missing

// Construct a node for this subgraph.
std::vector<nnvm::NodeEntry> inputs(num_inputs);
for (size_t i = 0; i < inputs.size(); i++) {
nnvm::Symbol *s = static_cast<nnvm::Symbol*>(input_symbols[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be const?


// Create CachedOp for the node.
std::vector<std::pair<std::string, std::string> > kwargs;
kwargs.push_back(std::pair<std::string, std::string>("inline_limit", "0"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not emplace back? More efficient and less noisy...

n->attrs.parsed = std::make_shared<mxnet::CachedOp>(*s, kwargs);

// Create a new symbol for this node.
s = new nnvm::Symbol();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this leak? Who manages this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just follow the implementations in other APIs. The symbol will be saved in a Python symbol handle. AFAIK, once the python symbol handle is destroyed, the symbol object will also be destroyed.

std::shared_ptr<CachedOp> op;
OpStatePtr forward_state;

CachedOpActualState(std::shared_ptr<CachedOp> op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reference?

@@ -1047,6 +1047,105 @@ void CachedOp::Backward(
Engine::Get()->set_bulk_size(prev_bulk_size);
}

struct CachedOpActualState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing class doc

}
};

void CachedOpForward(const OpStatePtr& state_ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a short documentation stating what's the intention and how it works

const std::vector<bool> &save_outputs = s.op->save_outputs();
CHECK_EQ(save_inputs.size(), in_end - in_begin);
CHECK_EQ(s.op->num_outputs(), out_end - out_begin);
for (auto it = in_begin; it != in_end; it++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++it is potentially faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? Where is this documented?

Copy link
Contributor

@larroy larroy Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-da it's a well known thing for old C++ farts. It's in reference C++ books like http://www.cppstdlib.com/ or Stroustrup. https://stackoverflow.com/questions/1077026/incrementing-iterators-it-more-efficient-than-it

In most cases probably doesn't make a difference, specially for simple iterators where the iterator is just a pointer. That's why I said is potentially faster. It's more like a good idiomatic practice to always use preincrement.

https://stackoverflow.com/questions/1077026/incrementing-iterators-it-more-efficient-than-it

@@ -116,6 +116,24 @@ class CachedOp {
DispatchMode* dispatch_mode,
std::vector<int> *in_attrs,
std::vector<int> *out_attrs);
bool ForwardInferShape(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation in function prototypes

'b': mx.nd.empty(shape=(10, 10))})
e1.forward()
e2.forward()
assert_almost_equal(e1.outputs[0].asnumpy(), e2.outputs[0].asnumpy(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why almost equal and not equal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think due to floating point precision

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@zheng-da
Copy link
Contributor Author

@larroy thanks for the review. The code, especially the API design, is still experimental. I'll let you know when the code is ready for review.

@reminisce
Copy link
Contributor

I think the function like shape, type inferences, FMutate, etc. used in operator registration should not belong to CachedOp only. They should be made generally available to subgraph-type operators, while CachedOp is just a special case.

@zheng-da zheng-da force-pushed the cachedop branch 2 times, most recently from 6a03241 to 476fa57 Compare July 25, 2018 08:14
@zheng-da zheng-da mentioned this pull request Jul 26, 2018
9 tasks
@sandeep-krishnamurthy sandeep-krishnamurthy added Operator Backend Issues related to the backend of MXNet pr-work-in-progress PR is still work in progress labels Aug 8, 2018
@zheng-da zheng-da force-pushed the cachedop branch 3 times, most recently from 0b5df8b to 910cc05 Compare August 24, 2018 23:56
@zheng-da
Copy link
Contributor Author

This PR should be rebased to #12157

@zheng-da zheng-da force-pushed the cachedop branch 2 times, most recently from cea32a3 to 3b616c2 Compare August 31, 2018 04:07
@zheng-da zheng-da changed the title [WIP] make CachedOp a normal operator [MXNET-876] make CachedOp a normal operator Aug 31, 2018
inline bool DefaultSubgraphOpShape(const nnvm::NodeAttrs& attrs,
std::vector<TShape> *in_shapes,
std::vector<TShape> *out_shapes) {
return DefaultSubgraphOpShape1(*attrs.subgraphs[0], in_shapes, out_shapes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename DefaultSubgraphOpShape1 to something like a helper function for better readability?

const auto& idx = g.indexed_graph();
const auto &outputs = idx.outputs();
/*
* This is the operator state of CachedOp when CachedOp is used in the symbol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on the necessity of adding this data structure in the description.

// Clean up what we recorded.
s.forward_state.reset();

// The arrays in out_ptrs may be changed by CachedOp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would it be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the comments

else
orig_is_train = Imperative::Get()->is_training();
// TODO(zhengda) is it right to use false here?
s.op->Backward(false, s.forward_state, in_ptrs, req, out_ptrs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add more comment on retain_graph=False

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Operator pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants