Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Graph dumper #15285

Closed
wants to merge 23 commits into from
Closed

Graph dumper #15285

wants to merge 23 commits into from

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Jun 20, 2019

Description

Utility to dump the computational graph, specially during backward for human consumption.

Dumps the graph to a dot file that can be rendered.

This is intended for developers, although if there's demand it can be exposed easily from python to get the graph from any variable if is_recording is true. It is done from C++ directly, no Python is required.

Added an environment variable which triggers the logic.

It should have no performance impact when it's not enabled as there's only one additional boolean check during backward, it access the environment only once.

@apeforest @anirudh2290

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@larroy larroy requested a review from anirudh2290 as a code owner June 20, 2019 01:40
@larroy
Copy link
Contributor Author

larroy commented Jun 20, 2019

#15198

@apeforest
Copy link
Contributor

Hi @larroy could you please paste a snapshot of what the dumped graph looks like? Thanks

@Roshrini Roshrini added the pr-work-in-progress PR is still work in progress label Jun 23, 2019
@larroy
Copy link
Contributor Author

larroy commented Jun 24, 2019

@apeforest I will paste the output next time I run it, right now I'm working on something else. It outputs a dot graph as in the comment here: https://github.com/apache/incubator-mxnet/pull/15285/files#diff-884210dca53fb4c4ac20ca07aa8ecfdfR82

@larroy
Copy link
Contributor Author

larroy commented Jun 27, 2019

@apeforest @kshitij12345 sounds familiar?

graphviz (2)

@larroy
Copy link
Contributor Author

larroy commented Jun 27, 2019

Named the head grads
graphviz (3)

@larroy
Copy link
Contributor Author

larroy commented Jun 27, 2019

First gradient, and second gradient:

Forward graph: 
digraph G {
  "var0" -> "log10 node_0"
}
Backward graph: 
digraph G {
  "head grad #0" -> "_backward_log10 node_0_backward"
  "var0" -> "log10 node_0"
  "var0" -> "_backward_log10 node_0_backward"
}
Forward graph: 
digraph G {
  "head grad #0" -> "_backward_log10 node_1"
  "var0" -> "_backward_log10 node_1"
}
Backward graph: 
digraph G {
  "head grad #0" -> "elemwise_mul node_1_backward_grad_grad_inp"
  "head grad #0" -> "_backward_log10 node_1"
  "var0" -> "_backward_log10 node_1"
  "var0" -> "reciprocal node_1_dlogx"
  "_backward_log10 node_1" -> "elemwise_mul node_1_d2ydx2_mid"
  "reciprocal node_1_dlogx" -> "elemwise_mul node_1_d2ydx2_mid"
  "elemwise_mul node_1_d2ydx2_mid" -> "negative node_1_d2ydx2"
  "negative node_1_d2ydx2" -> "elemwise_mul node_1_backward_grad_grad_inp"
}

@apeforest
Copy link
Contributor

@larroy Thanks for this contribution. The dumped graph does help developers understand more details about computation graph in forward + backward passes.

However, I also agree with @anirudh2290 there are many overlapping and orthogonal work in this PR compared with existing MXNet/NNVM code base. Can you keep this utility and graph class in a different repo instead of merging it to MXNet. I am afraid having different utilities to do highly similar work may cause more confusion to users. Please let me know if this makes sense.

@larroy
Copy link
Contributor Author

larroy commented Jul 1, 2019

@apeforest How can we keep it in a different repo if we have to call it inside backward? You mean a private branch?

There's indeed some overlap but it's simpler to have a C++ only component that dumps the graph for debug instead of going through non-specific JSon -> having to introduce python. Also that overlap is with NNVM which is an external repo.

I would vote to have this merged which I though twas the original intention and as @anirudh2290 is not blocking it.

The PR is ready missing fixing a minor CI issue with amalgamation. If it's not going to be merged I will abandon it.

@larroy larroy changed the title [WIP] Graph dumper Graph dumper Jul 5, 2019
@ptrendx
Copy link
Member

ptrendx commented Jul 9, 2019

I also do not understand why you need a whole new Graph class for the purpose of serializing to dot. nnvm::Graph and nnvm::IndexedGraph should give you everything you need for that, no?

@larroy
Copy link
Contributor Author

larroy commented Jul 9, 2019

@ptrendx I didn't see code that dumps directly to dot, but to Json as discussed above. If you guys have such a problem with introducing a graph class which has unit tests and all, then don't merge this, no problems. I implemented it like this and for me is useful. 🤷🏻‍♂️

@ptrendx
Copy link
Member

ptrendx commented Jul 9, 2019

@larroy There is no need to get so defensive about this, I (and I assume others as well) just want to make your contribution better as a result of the review process.
Going back to this graph class - the problem I have with introducing it is that I do not see any value added by it, as (I think) all of the functionality needed is already there in nnvm::Graph and IndexedGraph, while it makes the codebase less readable and maintainable. Do you have an example of functionality that is required but is not covered by the existing classes?

@ptrendx
Copy link
Member

ptrendx commented Jul 9, 2019

Also, to clarify - I do not think that reusing JSON dump and other utilities is a good approach to this. My concern is only about the new graph class itself instead of using existing tools. Removing this class would make this PR 400 lines smaller.

@larroy
Copy link
Contributor Author

larroy commented Jul 9, 2019

@ptrendx not getting defensive, that was not my intention, let's not read between the lines. Feedback should be concrete an actionable for efficient use of everybody's time, that's why I indicated that if the feedback is unspecific and not clearly positive with this change I'm fine with closing the PR in interest of everyones time instead of letting it rot here or having a long discussion.

I think there's many ways to implement something. I spent some effort implementing this to help us introspecting backward graphs, probably you could implement it in different ways, but this is the way I did it and I think is clean, concise, generic and has C++ unit tests.

I think having a generic data structure like the one I introduced in this PR is better than using a non-generic structure into which implementation concerns are mixed with the data structure over and over in the implementation itself. This is in a similar but more concerning way to having a linked list by having the next pointer inside the object or as a generic class like std::list. So for me having generic data structures make the code more maintainable.

For feedback to be useful it has also to be concrete, in this case I don't see the feedback to my code change proposal is concrete enough for me to take actions. Could you post an example of how would you propose to use indexed graph to do the same effect? Maybe that would help keep the conversation focused. It's also a debugging feature, so I don't feel specially passionate about defending my design decisions here.

Adding this functionality was requested by @apeforest @kshitij12345 and I while adding higher oder gradient.
I spent some time adding unit tests and making it generic enough that can be re-used for different purposes, so please understand my surprise when there are unspecific questions like why do we need this or why introduce a small single-header data structure.

Thanks for your comments.

@ptrendx
Copy link
Member

ptrendx commented Jul 9, 2019

For feedback to be useful it has also to be concrete, in this case I don't see the feedback to my code change proposal is concrete enough for me to take actions. Could you post an example of how would you propose to use indexed graph to do the same effect? Maybe that would help keep the conversation focused. It's also a debugging feature, so I don't feel specially passionate about defending my design decisions here.

Sure, so I would do this serialization more or less like this pseudo-code:

// g - the nnvm::Graph that is being serialized
idx = g.indexed_graph();
for (i = 0; i < idx.num_nodes(); ++i) {
const auto& node = g[i];
for (const auto& input : node.inputs()) {
// input.source is the src node
// node is the dst node
dump "src -> dst" to dot
}
}
for (const auto& output : idx.outputs) {
// output.source is the src node
// "Output N" is the dst node
dump "src->dst" to dot
}

@larroy
Copy link
Contributor Author

larroy commented Jul 10, 2019

@ptrendx Thanks
I will try to use your suggestion in the next iteration.

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional data structures just for the purpose of serialization is over-engineering.

@larroy
Copy link
Contributor Author

larroy commented Jul 26, 2019

@anirudh2290 @ptrendx @szha please review again, I removed the uber engineered code.

@@ -353,6 +356,11 @@ std::vector<NDArray*> Imperative::Backward(
<< "There are no inputs in computation graph that require gradients.";
}

if (backward_graph_dump_enabled_) {
std::cout << "Forward graph: " << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave this to the developer? I think we can just have a function DebugGraph() and not putting all the std::cout logic in a function block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you print a sample output of this DumpGraph function? We've already have one function to print the NNVM graph: https://github.com/apache/incubator-mxnet/blob/6a8d9eb5fd4f7133c094149dc80a3a236534f223/src/common/exec_utils.h#L285?

If the new graph dump function is better, can we consolidate it with the existing one? I am only concerned that having two versions of debugging functions not only increases code size but also adds confusion to developers.

@larroy
Copy link
Contributor Author

larroy commented Jul 26, 2019

@apeforest I'm confused by your latest review, have you read the comments above? there's sample output, and samples from the graph, please scroll.

About LogMemoryPlan, I don't think the objective of that function was dumping the graph. I think it served a different purpose.

Also we spent many days debugging second order gradients and backward graphs. I'm surprised that you are against this functionality.

src/nnvm/graph_dump.h Outdated Show resolved Hide resolved

namespace {

std::string unnamed(const std::string &s) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnamed name does not really fit to what the purpose of this function is. Also, the logic here makes your graph potentially incorrect if a person used the same name (or the same lack of name) for multiple nodes (which is unfortunately allowed by MXNet).
What I would prefer is a function named something like DeduplicateName where you would check whether the same name was not used before for different Node and add #number to the name (adding unnamed or better _unnamed_ would be fine there too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add a static atomic counter for that for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I tested most of the nodes have names from operators or variables, and head gradients I added names which makes the graph much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need an atomic counter for this - just a map that maps the std::string to list of Node* would be enough to differentiate whether your name is unique or not and what number you need to add to it.
There is nothing (unfortunately) in MXNet preventing you from naming your nodes or variables the same in the same graph. I agree that head grads should have a name, I actually did similar change in my pointwise fusion PR.

Copy link
Contributor Author

@larroy larroy Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it do the same effect with less memory usage to what I proposed just for the purpose of adding a number? I guess the counter doesn't need to be atomic anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not sure how a counter would help you solve the issue of separate nodes that have the same name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I got now your intention. I will make the suggested changes.

@larroy
Copy link
Contributor Author

larroy commented Aug 16, 2019

Question, is there interest to get this merged if I address @ptrendx comments? The action points that I have in mind are just deduplicating node names. I have limited bandwidth right now to keep iterating in this pr. Could we summarize which changes are required to get this in or should I close the PR? Thanks.

@larroy
Copy link
Contributor Author

larroy commented Sep 8, 2019

There doesn't seem to be interest to get this in. I will keep it as a private tool. Thanks for the reviews.

@larroy larroy closed this Sep 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants