-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hi @larroy could you please paste a snapshot of what the dumped graph looks like? Thanks |
@apeforest I will paste the output next time I run it, right now I'm working on something else. It outputs a dot graph as in the comment here: https://github.com/apache/incubator-mxnet/pull/15285/files#diff-884210dca53fb4c4ac20ca07aa8ecfdfR82 |
@apeforest @kshitij12345 sounds familiar? |
First gradient, and second gradient:
|
@larroy Thanks for this contribution. The dumped graph does help developers understand more details about computation graph in forward + backward passes. However, I also agree with @anirudh2290 there are many overlapping and orthogonal work in this PR compared with existing MXNet/NNVM code base. Can you keep this utility and graph class in a different repo instead of merging it to MXNet. I am afraid having different utilities to do highly similar work may cause more confusion to users. Please let me know if this makes sense. |
@apeforest How can we keep it in a different repo if we have to call it inside backward? You mean a private branch? There's indeed some overlap but it's simpler to have a C++ only component that dumps the graph for debug instead of going through non-specific JSon -> having to introduce python. Also that overlap is with NNVM which is an external repo. I would vote to have this merged which I though twas the original intention and as @anirudh2290 is not blocking it. The PR is ready missing fixing a minor CI issue with amalgamation. If it's not going to be merged I will abandon it. |
I also do not understand why you need a whole new Graph class for the purpose of serializing to dot. nnvm::Graph and nnvm::IndexedGraph should give you everything you need for that, no? |
@ptrendx I didn't see code that dumps directly to dot, but to Json as discussed above. If you guys have such a problem with introducing a graph class which has unit tests and all, then don't merge this, no problems. I implemented it like this and for me is useful. 🤷🏻♂️ |
@larroy There is no need to get so defensive about this, I (and I assume others as well) just want to make your contribution better as a result of the review process. |
Also, to clarify - I do not think that reusing JSON dump and other utilities is a good approach to this. My concern is only about the new graph class itself instead of using existing tools. Removing this class would make this PR 400 lines smaller. |
@ptrendx not getting defensive, that was not my intention, let's not read between the lines. Feedback should be concrete an actionable for efficient use of everybody's time, that's why I indicated that if the feedback is unspecific and not clearly positive with this change I'm fine with closing the PR in interest of everyones time instead of letting it rot here or having a long discussion. I think there's many ways to implement something. I spent some effort implementing this to help us introspecting backward graphs, probably you could implement it in different ways, but this is the way I did it and I think is clean, concise, generic and has C++ unit tests. I think having a generic data structure like the one I introduced in this PR is better than using a non-generic structure into which implementation concerns are mixed with the data structure over and over in the implementation itself. This is in a similar but more concerning way to having a linked list by having the next pointer inside the object or as a generic class like std::list. So for me having generic data structures make the code more maintainable. For feedback to be useful it has also to be concrete, in this case I don't see the feedback to my code change proposal is concrete enough for me to take actions. Could you post an example of how would you propose to use indexed graph to do the same effect? Maybe that would help keep the conversation focused. It's also a debugging feature, so I don't feel specially passionate about defending my design decisions here. Adding this functionality was requested by @apeforest @kshitij12345 and I while adding higher oder gradient. Thanks for your comments. |
Sure, so I would do this serialization more or less like this pseudo-code:
|
@ptrendx Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional data structures just for the purpose of serialization is over-engineering.
@anirudh2290 @ptrendx @szha please review again, I removed the uber engineered code. |
@@ -353,6 +356,11 @@ std::vector<NDArray*> Imperative::Backward( | |||
<< "There are no inputs in computation graph that require gradients."; | |||
} | |||
|
|||
if (backward_graph_dump_enabled_) { | |||
std::cout << "Forward graph: " << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave this to the developer? I think we can just have a function DebugGraph() and not putting all the std::cout logic in a function block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you print a sample output of this DumpGraph function? We've already have one function to print the NNVM graph: https://github.com/apache/incubator-mxnet/blob/6a8d9eb5fd4f7133c094149dc80a3a236534f223/src/common/exec_utils.h#L285?
If the new graph dump function is better, can we consolidate it with the existing one? I am only concerned that having two versions of debugging functions not only increases code size but also adds confusion to developers.
@apeforest I'm confused by your latest review, have you read the comments above? there's sample output, and samples from the graph, please scroll. About LogMemoryPlan, I don't think the objective of that function was dumping the graph. I think it served a different purpose. Also we spent many days debugging second order gradients and backward graphs. I'm surprised that you are against this functionality. |
src/nnvm/graph_dump.cc
Outdated
|
||
namespace { | ||
|
||
std::string unnamed(const std::string &s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnamed
name does not really fit to what the purpose of this function is. Also, the logic here makes your graph potentially incorrect if a person used the same name (or the same lack of name) for multiple nodes (which is unfortunately allowed by MXNet).
What I would prefer is a function named something like DeduplicateName
where you would check whether the same name was not used before for different Node and add #number
to the name (adding unnamed
or better _unnamed_
would be fine there too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a static atomic counter for that for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I tested most of the nodes have names from operators or variables, and head gradients I added names which makes the graph much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need an atomic counter for this - just a map that maps the std::string
to list of Node*
would be enough to differentiate whether your name is unique or not and what number you need to add to it.
There is nothing (unfortunately) in MXNet preventing you from naming your nodes or variables the same in the same graph. I agree that head grads should have a name, I actually did similar change in my pointwise fusion PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it do the same effect with less memory usage to what I proposed just for the purpose of adding a number? I guess the counter doesn't need to be atomic anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not sure how a counter would help you solve the issue of separate nodes that have the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think I got now your intention. I will make the suggested changes.
Question, is there interest to get this merged if I address @ptrendx comments? The action points that I have in mind are just deduplicating node names. I have limited bandwidth right now to keep iterating in this pr. Could we summarize which changes are required to get this in or should I close the PR? Thanks. |
There doesn't seem to be interest to get this in. I will keep it as a private tool. Thanks for the reviews. |
Description
Utility to dump the computational graph, specially during backward for human consumption.
Dumps the graph to a dot file that can be rendered.
This is intended for developers, although if there's demand it can be exposed easily from python to get the graph from any variable if is_recording is true. It is done from C++ directly, no Python is required.
Added an environment variable which triggers the logic.
It should have no performance impact when it's not enabled as there's only one additional boolean check during backward, it access the environment only once.
@apeforest @anirudh2290
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.