Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[C++] Improve inference script to support benchmark on Imagenet #15164

Merged
merged 12 commits into from
Jun 26, 2019

Conversation

wuxun-zhang
Copy link
Contributor

@wuxun-zhang wuxun-zhang commented Jun 6, 2019

Description

This PR is used to improve C++ inference script. Now users can run benchmark with FP32/INT8 models on Imagenet. Simultaneously, there are a bit latency improvement of C++ over Python.

The below table shows latency improvement between C++ and Python for several CNN model.
(Update numbers based on C5.18xlarge)

Latency (BS=1) speedup FP32 INT8
resnet18_v1 4.75% 12.14%
resnet50_v1 2.65% 5.67%
resnet101_v1 2.92% 3.18%
inceptionV3 3.94% 5.75%
mobilenet1.0 5.88% 16.32%
mobilenetv2_1.0 10.40% 16.72%
squeezeNet 12.76% 16.43%
resnet152 2.12% 2.04%
inception-bn 3.40% 9.37%

@pengzhao-intel @ZhennanQin @TaoLv @ciyongch Please help to review. Thanks.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@wuxun-zhang wuxun-zhang requested a review from nswamy as a code owner June 6, 2019 09:56
@pengzhao-intel
Copy link
Contributor

@wuxun-zhang ping me by this ID :)

@pengzhao-intel pengzhao-intel added C++ Related to C++ Quantization Issues/Feature Requests related to Quantization labels Jun 6, 2019
@piyushghai
Copy link
Contributor

@wuxun-zhang Thanks! That's a great addition!
@leleamol Would you also be able to take a look at this ?

@mxnet-label-bot Add [pr-awaiting-review, C++]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Jun 7, 2019
@pengzhao-intel
Copy link
Contributor

@wuxun-zhang could you rebase the code?

@wuxun-zhang wuxun-zhang force-pushed the c++_inference_script branch 2 times, most recently from 1b4784f to 19ced82 Compare June 12, 2019 05:48
@wuxun-zhang
Copy link
Contributor Author

Now CI has passed, could you please take a look at this PR again? @pengzhao-intel @ZhennanQin @TaoLv @ciyongch Thanks

|[ResNet152-V2](#8)|[MXNet ModelZoo](http://data.mxnet.io/models/imagenet/resnet/152-layers/)|[Validation Dataset](http://data.mxnet.io/data/val_256_q90.rec)|76.65%|76.36%|
|[Inception-BN](#9)|[MXNet ModelZoo](http://data.mxnet.io/models/imagenet/inception-bn/)|[Validation Dataset](http://data.mxnet.io/data/val_256_q90.rec)|72.28%|72.20%|

The following performance numbers are collected by using Skylake 6148 with 20 physical cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh the data by C5.x18large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. Already Done.

@pengzhao-intel
Copy link
Contributor

@szha is it ok to change the file name from inception_inference.cpp to imagenet_inference.cpp to cover more models? Any backward compatible issue?

@pengzhao-intel
Copy link
Contributor

@anirudh2290 would you mind reviewing this PR?

*/
void Predictor::InitParameters() {
std::vector<mx_uint> data_shape;
for (index_t i=0; i < input_shape.ndim(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style

// initializer to call
Xavier xavier(Xavier::uniform, Xavier::avg, 2.0f);
index_t i = 0;
for (auto& name : net.ListArguments()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use const if possible.

}

i = 0;
for (auto& name : net.ListAuxiliaryStates()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use const if possible.


// initializer to call
Xavier xavier(Xavier::uniform, Xavier::avg, 2.0f);
index_t i = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why declare i?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for indexing infered shape of each arguments (see Line334), which is needed for Xavier to initialize. So does i = 0 in Line339.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite strange usage. If so, please declare i in for loop header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

args_map[name] = tmp_arr.Copy(global_ctx);
}

i = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant.


// infer and create ndarrays according to the given input ndarrays.
net.InferExecutorArrays(global_ctx, &arg_arrays, &grad_arrays, &grad_reqs, &aux_arrays, args_map,
std::map<std::string, NDArray>(), std::map<std::string, OpReqType>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent


double get_msec() {
struct timeval time;
gettimeofday(&time, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Linux function, if you want to support this example on Windows, you need to define windows code path with macro. Of course, it's not mandatory. Maybe you can add some comment for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can use std::chrono::system_clock::now() to cover Windows and Linux.

std::map<std::string, NDArray> aux_map;
Symbol net;
Executor *executor;
Shape input_shape;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add suffix _ to these private vars.


// infer and create ndarrays according to the given input ndarrays.
net.InferExecutorArrays(global_ctx, &arg_arrays, &grad_arrays, &grad_reqs, &aux_arrays, args_map,
std::map<std::string, NDArray>(), std::map<std::string, OpReqType>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent.

net.InferExecutorArrays(global_ctx, &arg_arrays, &grad_arrays, &grad_reqs, &aux_arrays, args_map,
std::map<std::string, NDArray>(), std::map<std::string, OpReqType>(),
aux_map);
for (auto& i : grad_reqs) i = OpReqType::kNullOp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In InferExecutorArrays function, if a user didn't specify OpReqType for an argument, it will be assigned a default value, like kNullOp for data and label, kWriteTo for others. However, we don't need consider updating gradients in only forward pass, so we need reassign kNullOp for all arguments.

return false;
}

std::vector<index_t> shape_vec {input_shape[1], input_shape[2], input_shape[3]};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if input_shape is not 4D? better not use hard code here to get the shape value from input_shape.

std::vector<float> dummy_data(input_shape.Size());
std::default_random_engine generator;
std::uniform_real_distribution<float> val(0.0f, 1.0f);
for (int i = 0; i < static_cast<int>(input_shape.Size()); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a large input array, int might not be enough to indexing the array.

}
}
ms = get_msec() - ms;
num_inference_batches = (nBatch == num_inference_batches) ? num_inference_batches : nBatch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as num_inference_batches = nBatch ?

num_inference_batches = (nBatch == num_inference_batches) ? num_inference_batches : nBatch;

auto args_name = net.ListArguments();
std::cout << "INFO:" << "Dataset for inference: " << dataset_ << std::endl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use LG << here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


dst_vec.push_back(elem);
while (*p_next) {
if (!bFloat) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent.

int index = 1;
while (index < argc) {
if (strcmp("--symbol_file", argv[index]) == 0) {
index++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent.

std::string input_shape("3 224 224");
int seed = 48564309;
int shuffle_chunk_seed = 3982304;
int data_nthreads = 60;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not the recommened num_threads by default, please change to a proper one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, num_threads for data decoding will be determined by two parameters. One is data_nthreads and another is user-defined OMP_NUM_THREADSenv variable through calling omp_get_num_procs(), please refer to iter_image_recordio_2.cc#L146. So here I want to set a large number to make use of all cores. Also, data_nthreads = 60 is to align with the current Python implementation.

# under the License.

# Downloading the data and model
mkdir -p model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to only download the files when they're not existed in the specified directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. I used wget -nc below to avoid re-downloading dataset and models.

# specific language governing permissions and limitations
# under the License.

# Downloading the data and model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add set -ex

```
Alternatively, The script [unit_test_inception_inference.sh](<https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/inference/unit_test_inception_inference.sh>) downloads the pre-trained **Inception** model and a test image. The users can invoke this script as follows:
For a quickly inference test, users can directly run [unit_test_imagenet_inference.sh](<https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/inference/unit_test_imagenet_inference.sh>) by using the below command. This script will automatically download the pre-trained **Inception-Bn** and **resnet50_v1_int8** model and **validation dataset** which are required for inference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a quickly inference test -> For a quick inference test

}
virtual void InitQuantizedBias(NDArray* arr) {
(*arr) = (int8_t)0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt operator= only support float currently ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. Now operator= does not support int8. If I add support for int8, there will be ambiguity due to implicit type conversion. For example, when I try to use (*arr) = 0 to initialize a NDArray, the compiler doesn't know which operator= function should be called. So, as a work-around method, I will assign a float value to a int8 NDArray temporarily.

@pengzhao-intel
Copy link
Contributor

@anirudh2290 @ZhennanQin @ciyongch please review again if all concerns are resolved :)

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

|[ResNet152-V2](#8)|[MXNet ModelZoo](http://data.mxnet.io/models/imagenet/resnet/152-layers/)|[Validation Dataset](http://data.mxnet.io/data/val_256_q90.rec)|76.65%|76.36%|
|[Inception-BN](#9)|[MXNet ModelZoo](http://data.mxnet.io/models/imagenet/inception-bn/)|[Validation Dataset](http://data.mxnet.io/data/val_256_q90.rec)|72.28%|72.20%|

The following performance numbers are collected by using C5.18xlarge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following performance numbers are collected by using C5.18xlarge.
The following performance numbers are collected by using AWS EC2 C5.18xlarge.


This example demonstrates image classification workflow with pre-trained models using MXNet C++ API. Now this script also supports inference with quantized CNN models generated by Intel® MKL-DNN (see this [quantization flow](https://github.com/apache/incubator-mxnet/blob/master/example/quantization/README.md)). By using C++ API, the latency of most models can get **2%~17%** speedup compared with current Python implementation. The below tables show accuracy and latency of several CNN models.

The following models have been tested on Linux systems. And 50000 images are used to collect the following accuracy numbers.
Copy link
Member

@TaoLv TaoLv Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that C++ inference should have the same accuracy as python inference. But seems the data here is different from https://github.com/apache/incubator-mxnet/tree/master/example/quantization. I guess these two tables might use different versions of MXNet or GluonCV. But it's still confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have tested accuracy for C++ and Python locally. There are no difference. I think the numbers in page maybe too old and need to be updated.


# Running inference on imagenet.
if [ "$(uname)" == "Darwin" ]; then
echo ">>> INFO: FP32 real data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation.

echo ">>> INFO: FP32 dummy data"
DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:../../../lib ./imagenet_inference --symbol_file "./model/Inception-BN-symbol.json" --batch_size 1 --num_inference_batches 500 --benchmark
else
echo ">>> INFO: FP32 real data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

lib_name=$(ls -a ../../../lib | grep -oE 'mkldnn' | tail -1)
if [[ -n ${lib_name} ]] && [[ 'mkldnn' =~ ${lib_name} ]]; then
echo ">>> INFO: INT8 dummy data"
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:../../../lib ./imagenet_inference --symbol_file "./model/resnet50_v1_int8-symbol.json" --batch_size 1 --num_inference_batches 500 --benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have int8 real data inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I cannot find someplace to download INT8 params file which is required for real data inference. So, I just added a dummy data inference here for INT8 resnet50 model.

@wuxun-zhang wuxun-zhang requested a review from szha as a code owner June 22, 2019 08:39
@wuxun-zhang
Copy link
Contributor Author

@TaoLv Have updated the accuracy numbers in https://github.com/apache/incubator-mxnet/tree/master/example/quantization. There are no difference in Top-1 accuracy for C++ and Python. Please help review again. Thanks.

@TaoLv
Copy link
Member

TaoLv commented Jun 22, 2019

@wuxun-zhang Thank you for addressing the comments. The problem is whether we want to maintain two int8 accuracy tables in different places. @pengzhao-intel what do you think?

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang Thank you for addressing the comments. The problem is whether we want to maintain two int8 accuracy tables in different places. @pengzhao-intel what do you think?

Agree, so add a link to point the accuracy into quantization README and performance data to perf.md.

## [imagenet_inference.cpp](<https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/inference/imagenet_inference.cpp>)

This example demonstrates image classification workflow with pre-trained models using MXNet C++ API. Now this script also supports inference with quantized CNN models generated by Intel® MKL-DNN (see this [quantization flow](https://github.com/apache/incubator-mxnet/blob/master/example/quantization/README.md)). By using C++ API, the latency of most models can get **2%~17%** speedup compared with current Python implementation. The below tables show accuracy and latency of several CNN models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention the exact number (2%~15%) since this number may be changed with the evolution of python and C++ interface.

@pengzhao-intel
Copy link
Contributor

After offline talking, I think we can leave the performance of CLX (c5.12xlarge/24xlarge) on this page :)
Please rebase the code.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Jun 26, 2019

Seems all checks are passed but hang now.
Maybe retrigger CI?

@pengzhao-intel
Copy link
Contributor

Thanks for the nice PR and merge now.

@pengzhao-intel pengzhao-intel merged commit 009907a into apache:master Jun 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C++ Related to C++ pr-awaiting-review PR is waiting for code review Quantization Issues/Feature Requests related to Quantization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants