Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Utility to help developers debug operators: Tensor Inspector #15490

Merged
merged 29 commits into from
Jul 17, 2019
Merged

Utility to help developers debug operators: Tensor Inspector #15490

merged 29 commits into from
Jul 17, 2019

Conversation

Zha0q1
Copy link
Contributor

@Zha0q1 Zha0q1 commented Jul 8, 2019

Created a separate PR for a new tutorial on usage.
#15517

Background

When developing new operators, developers need to deal with Tensors, TBlobs, and NDArrays extensively (for simplicity, we will simply refer them as “tensors”). However, there is no existing tool that helps developers inspect the value of the tensors within an operator, so it’s hard to locate the spot where the computation goes wrong. With that said, we can create a new utility that supports inspecting, dumping and verifying tensor values to alleviate the developer’s need to use hacky prints every time. This new utility will support all the three data types (Tensor, TBlob, NDArray), as well as both CPU and GPU tensors.

Some related GitHub issues: #15198

Methodology

First off, to support all three data types, we will need to first convert them to TBlob, so that we can base our new functionalists on one single class and avoid code repetition. To access the value of a tensor/TBlob object, the easiest way is to add some function, such as to_string(), to the TBlob class. Then, we could use shape_ to dereference dptr and extract the actual data from the tensor in a structured format. However, to preserve the integrity of the TBlob class, we would want to avoid making direct changes to tensor_blob.h. Instead, considering that all the member variables of TBlob are all public, we could create a wrapper around TensorBlob, which we would call TensorInspector, and we could then add our new functions to this new wrapper class. We could put this new file tensor_inspector.h in src/common along with a bunch of other utilities.

class TensorInspector {
  private:
    const TBlob tb_;
  public:
    // our new functions
}

Besides tensor inspectors which are one-to-one with tensors, I also propose to have a singleton class called InspectorManage which is responsible for data sharing and maintaining locking between the individual TensorInspector objects. This way, we could control some global behaviors within each of the objects.

The Value of Creating a New Utility

It’s important to point out that there is an existing utility that helps print out the value of a tensor, tests/cpp/include/test_util.h, (https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/include/test_util.h#L472). However, this new utility under discussion aims to start from the use cases of the users, who in our case are developers, and has more functionalities than the existing utility. Note: this new utility will use CAccessAsCPU from test_util.h for data copying from GPU memory to CPU memory.

Proposed Funtionalities/APIs

Create a TensorInspector Object from Tensor, TBlob, and NDarray Objects

We can create a TensorInspector object by passing in a Tensor, TBlob, or NDArray object.

// Convolution-inl.h, Forward()
Tensor<xpu, 3, DType> weight_3d = in_data[conv::kWeight].get_with_shape<xpu, 3, DType>(
        Shape3(group_, M, K), s);
TensorInspector ti(weight_3d, ctx.run_ctx);

// Similarly, we could do:
TensorInspector ti(/*TBlob obj*/, ctx.run_ctx);
// OR:
TensorInspector ti(/*NDArray obj*/, ctx.run_ctx);

Print Tensor Value (static)

We could use std::string TensorInspector::to_string() or void TensorInspector::print_string() to get a formatted string representation of the entire tensor. We could see the tensor value in a structured way, the data type, as well as the tensor shape.

TensorInspector ti(/*Tensor, Tblob, or Ndarray obj*/, /*RunContext obj*/);
ti.to_string(); // print tensor

Screen Shot 2019-07-08 at 4 07 16 PM

Interactively Print Tensor Value (dynamic)

During debugging, situations might occur that ahead of time (at compilation time), we do not know which part of a tensor we need to inspect. Also, sometimes we might want to pause the operator control flow to “zoom into” an erroneous part of a tensor multiple times until we are satisfied to move on. In this regard, we could use void TensorInspector::interactive_print(std::string tag = "") like a break point to dynamically specify the part of the tensor to inspect and whether we want to skip the break point. tag is an optional parameter where we can give the break points names so that we can distinguish them. A “Visit” counter would tell us how many times have we stepped into a break point with this tag name.

    TensorInspector ti(weight_3d);
    ti.interactive_print("My Convolution");

Screen Shot 2019-07-10 at 5 29 07 PM

As we can see from the screenshot, we would be in a loop that asks us for command. We would be prompted with the Tensor info, <f Tensor 20x1x5x5>, and we could enter coordinates such as “0,0” to print only a part of the tensor. We could also type “e” for the entire tensor, “b” to break from this loop, and “s” to break from this loop and skip all the future break points.

Check Tensor Value

Sometimes, developers might want to check if the tensor contains unexpected values which could be negative values, Nans, or others. We could have a templated api std::vector<std::vector<int>>TensorInspector::validate_value(const ValueChecker& vc, bool interactive = false, std::string tag = "") where vc is a bool value checker function that users can define, interactive is whether you want to check the values on the go (similar to interactive print, refer to the first screenshot below), and tag is the name you want to give to the call to identify it. The return is a vector of coordinates (represented with vector) at which checker evaluates to true.

Screen Shot 2019-07-10 at 5 34 20 PM

To the developer's convenience, we would also pre-define a bunch of value checkers. Refer to the Enum below:

Screen Shot 2019-07-10 at 5 24 41 PM

Developers could skip wiring their own checker by using this API: std::vector<std::vector<int>>TensorInspector::validate_value(ValueChecker vc, bool interactive = false, std::string tag = "")

This API can also be used creatively to find coordinates where the values equal to some required value or fall in some range.

Dump Tensor Value

Sometimes, we would want to dump the tensor to a file in binary mode, so that we could use, for example, a python script to further analyze the numbers. We could use void TensorInspector::dump_value(std::string tag) to achieve this. We would want to use .npy format for its compatibility with Numpy and its great load efficiency. A benchmark can be found here: https://github.com/mverleg/array_storage_benchmark.

Screen Shot 2019-07-10 at 5 45 44 PM

In this screenshot, we can compare side by side the regular print (left) and the value dump (right). Even without formatting the dumped value, we can see it has a better accuracy.

Screen Shot 2019-07-10 at 5 14 09 PM

Screen Shot 2019-07-10 at 5 14 43 PM

As we can see in the screenshot, the output files are named "{tag}_{visit}.npy", where tag is the name that we give to the call, and visit is the visit count.

To load the file, we could just do:

import numpy as np
a = np.load('abc_1.npy')
print(a)

Screen Shot 2019-07-10 at 5 17 29 PM

Notice: in interactive print, we could also do value dump with command "d":

Screen Shot 2019-07-10 at 5 29 07 PM

Usage

To test or use this new utility in your operator, just include the .h file and construct a TensorInspector object, say "ti", as described above. Then, you can use all functionalities listed above by writing, for example, "ti.print_string()" or "ti.interactive_print()". Finally, just run any python script that uses the operator that you modified.

Screen Shot 2019-07-11 at 4 57 41 PM

TODOs

  1. Support sparse NDArrays if necessary
  2. Check corner cases e.g. if Tblob's shape is unknown

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@sandeep-krishnamurthy
Copy link
Contributor

Awesome! thank you for the nice utility. 👍 for clear and detailed PR description.

@Zha0q1 Zha0q1 changed the title [WIP] Utility to help developers debug operators: Tensor inspector [WIP] Utility to help developers debug operators: Tensor Inspector Jul 9, 2019
@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 9, 2019

I could probably rewrite build_checker() (https://github.com/apache/incubator-mxnet/pull/15490/files#diff-9e7d5c2420ecc900c4a85a3c35d91bffR446) with macro to suppress warnings when DType is not floating point. But it's not causing any issues because in such cases the control flow wouldn't go by the undefined operation. Will change if have time.

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 11, 2019

I could probably rewrite build_checker() (https://github.com/apache/incubator-mxnet/pull/15490/files#diff-9e7d5c2420ecc900c4a85a3c35d91bffR446) with macro to suppress warnings when DType is not floating point. But it's not causing any issues because in such cases the control flow wouldn't go by the undefined operation. Will change if have time.

After experimenting for several hours today, I think there is not a good way to breach at compile time to suppress the warnings. Macro apparently did not work. I also tried simulating C++17 style constexpr if with templated structs, which is an idea proposed by https://baptiste-wicht.com/posts/2015/07/simulate-static_if-with-c11c14.html, but that method was compiler specific and did not work either.

Having no warnings when compiling is nice, but given that this is a developer's tool and that behavior-wise the code has no issue, I think I will leave the warnings.

@Zha0q1 Zha0q1 changed the title [WIP] Utility to help developers debug operators: Tensor Inspector Utility to help developers debug operators: Tensor Inspector Jul 11, 2019
@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 11, 2019

@apeforest @anirudh2290 @access2rohit @larroy @sandeep-krishnamurthy ,
Hi guys, I completed this PR and it's ready for review. The only thing that I do not have yet is support for CSR NDArrays because I want to ask your opinion. I think I probably do not need to generate the whole matrix. Having something like this in my interactive print should be good enough? (this is in the existing test::print)
Screen Shot 2019-07-10 at 6 49 08 PM

@access2rohit
Copy link
Contributor

Can you split the PR into 2 separate PRs. One for CPU and one for GPU ?

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 11, 2019

Just realized there were more than one Tensor structs. Added mshadow:: before Tensor so compiler would not get confused. a41720d#diff-9e7d5c2420ecc900c4a85a3c35d91bffR623

Copy link
Contributor

@rondogency rondogency left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/common/tensor_inspector.h Outdated Show resolved Hide resolved
@Zha0q1 Zha0q1 mentioned this pull request Jul 12, 2019
7 tasks
@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 12, 2019

Created a separate PR for a new tutorial on usage.
#15517

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I see this being very useful for operator developers.

@access2rohit
Copy link
Contributor

LGTM just address my last comment and no need to split this PR into 2 now since you have already addressed most comments

@access2rohit
Copy link
Contributor

retrigger the build ... there is some julia error. Probably not because of your code changes. Also add link to the test cases that you covered

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 16, 2019

retrigger the build ... there is some julia error. Probably not because of your code changes. Also add link to the test cases that you covered

cases covered are described in my tutorial at the end. Link: #15517

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 17, 2019

@sandeep-krishnamurthy Looks ready!

if (pos->size() > static_cast<size_t>(dimension)) {
return false;
}
for (unsigned i = 0; i < pos->size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace unsigned with size_t

const int sub_dim = dimension - pos.size();
sub_shape->resize(sub_dim);
index_t multiple = 1;
for (int i = pos.size(), j = 0; i < dimension; ++i, ++j) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change int to index_t or size_t

}
index_t sum = 0;
index_t m = 1;
for (int i = pos.size() - 1; i >= 0; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change int to index_t

@apeforest
Copy link
Contributor

Three more changes. Otherwise good to go. Thanks for your contribution.

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jul 17, 2019

@sandeep-krishnamurthy Looks ready!

After a few more changes requested by Lin

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apeforest apeforest merged commit 35b5480 into apache:master Jul 17, 2019
struct InspectorManager {
static InspectorManager* get() {
static std::mutex mtx;
static std::unique_ptr<InspectorManager> im = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an utility function encapsulating this idiom? this is being used in several places, and would be good to use this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.