Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Multi-threaded inference broken with MKLDNN #15576

Closed
arcadiaphy opened this issue Jul 17, 2019 · 11 comments
Closed

Multi-threaded inference broken with MKLDNN #15576

arcadiaphy opened this issue Jul 17, 2019 · 11 comments

Comments

@arcadiaphy
Copy link
Member

arcadiaphy commented Jul 17, 2019

Description

I want to do multi-threaded inference with shared model parameters, so I'm testing the MXPredCreateMultiThread API in cpp example. I find that the example is broken with more than 1 thread on MKLDNN build: the output of model inference is not deterministic. If I run it with openblas build, everything is normal.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.16')
('Compiler     :', 'GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)')
('Build        :', ('default', 'Jun 19 2019 07:40:37'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '19.1.1')
('Directory    :', '/usr/local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/Users/arcadia/work/mxnet-1.5/python/mxnet')
Commit hash file "/Users/arcadia/work/mxnet-1.5/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
('Library      :', ['/Users/arcadia/work/mxnet-1.5/python/mxnet/../../lib/libmxnet.so'])
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✖ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✖ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
('Platform     :', 'Darwin-18.2.0-x86_64-i386-64bit')
('system       :', 'Darwin')
('node         :', 'MacBookPro')
('release      :', '18.2.0')
('version      :', 'Darwin Kernel Version 18.2.0: Fri Oct  5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64')

Build info (Required if built from source)

MXNet commit hash:
latest commit
4d07d78

Minimum reproducible example

I've slightly modified the cpp example to print the first 10 numbers in the output ndarray:
download the code and replace the example/image-classification/predict-cpp folder.

In order to run the example, the changes in PR #15574 is needed to patch the mxnet code.

MXNET_ENGINE_TYPE=NaiveEngine ./image-classification-predict path/to/image [thread number]

The output

openblas 2 threads:

./model/mobilenetv2_0.25-symbol.json ... 146314 bytes
./model/mobilenetv2_0.25-0000.params ... 6135676 bytes
[02:11:17] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
3.16592 -2.01711 -3.29867 -10.9845 -1.56596 2.30844 -4.57165 -0.211675 0.300732 -4.35327
3.16592 -2.01711 -3.29867 -10.9845 -1.56596 2.30844 -4.57165 -0.211675 0.300732 -4.35327
run successfully

mkldnn 2 threads:

./model/mobilenetv2_0.25-symbol.json ... 146314 bytes
./model/mobilenetv2_0.25-0000.params ... 6135676 bytes
[02:12:40] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
-4.33902 -8.3967 -3.97791 -2.16607 -7.36041 -7.65456 -6.08846 -3.10948 -6.65086 4.17571
-0.0463818 0.440686 -0.907998 -7.17286 4.8312 -3.03199 4.31585 0.0692099 -0.844623 -8.1773
run successfully

The results change randomly with every execution.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@arcadiaphy
Copy link
Member Author

@pengzhao-intel

@marcoabreu
Copy link
Contributor

MXNet does not support multithreading from the interface level. Not even locking based access. The only way to use MXNet in a multi-threaded fashion is by using a jobqueue that is consumed by a sticky thread.

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang @ZhennanQin please help take a look for this issue, thanks.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Jul 18, 2019

@pengzhao-intel @wuxun-zhang @ZhennanQin

The difference between MXPredCreateMultiThread and create executor separately is that the former shares model parameters. I've tested that if the weight sharing is disabled by moving arg_arrays and aux_arrays creation into for loop, then the result is normal even with MKLDNN.

  for (int i = 0; i < num_threads; i++) {
    std::vector<NDArray> arg_arrays, aux_arrays;
    for (size_t i = 0; i < arg_shapes.size(); ++i) {
      NDArray nd = NDArray(arg_shapes[i], ctx);
      if (arg_params.count(arg_names[i]) != 0) {
        CopyFromTo(arg_params[arg_names[i]], &nd);
      }
      arg_arrays.push_back(nd);
    }
    for (size_t i = 0; i < aux_shapes.size(); ++i) {
      NDArray nd = NDArray(aux_shapes[i], ctx);
      if (aux_params.count(aux_names[i]) != 0) {
        CopyFromTo(aux_params[aux_names[i]], &nd);
      }
      aux_arrays.push_back(nd);
    }

    std::unique_ptr<MXAPIPredictor> ret(new MXAPIPredictor());
    ret->sym = sym;
    ret->ctx = ctx;
    ret->key2arg = key2arg;
    ret->arg_arrays = arg_arrays;
    ret->aux_arrays = aux_arrays;
    ret->out_shapes = out_shapes;

    if (!lazy) {
      std::map<std::string, Context> ctx_map;
      std::vector<NDArray> grad_store(arg_arrays.size());
      std::vector<OpReqType> grad_req(arg_arrays.size(), kNullOp);
      ret->exec.reset(Executor::Bind(sym, ctx, ctx_map,
                                     arg_arrays,
                                     grad_store, grad_req,
                                     aux_arrays));
      ret->out_arrays = ret->exec->outputs();
    }
    out[i] = ret.release();
  }

It's very strange, I think the model parameters are read-only, will it affect MKLDNN calling?

@ZhennanQin
Copy link
Contributor

For MKLDNN, it doesn't support multi-threading before v1.0 because it shares internal scratch memory for all operators. So simultaneously running 2 mkldnn operators in same process can't guarantee to provide correct result. Currently we suggest to switch to multi-instance with shared memory for multi-threading purpose.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Jul 18, 2019

@ZhennanQin I think MKLDNN is OK in multi-threading, the possible reason is that calling mkldnn related method like MKLDNNDataReorder and GetMKLDNNData on ndarray simultaneously messes up the internal mkl_mem_. Also I'm not sure whether mkl_mem_ can be shared in multiple computation.

@arcadiaphy
Copy link
Member Author

Similar race condition on ndarray: #9862

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Jul 18, 2019

Looks like the MKLDNNDataReorder and Reorder2Default will alter the underling data of ndarray, so model parameters are not read-only. I've tried adding a mutex lock on MKLDNNDataReorder, then the problem vanishes.

@zheng-da, can we remove the memcpy in these two methods and make model weights truly read-only? It will fix the broken parallel inference example.

@pengzhao-intel
Copy link
Contributor

@arcadiaphy did you local solution work for this issue?

@arcadiaphy
Copy link
Member Author

@pengzhao-intel My local solution works fine, but it's not suitable for an official PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants