Multi-threaded inference broken with MKLDNN #15576

arcadiaphy · 2019-07-17T18:18:01Z

Description

I want to do multi-threaded inference with shared model parameters, so I'm testing the MXPredCreateMultiThread API in cpp example. I find that the example is broken with more than 1 thread on MKLDNN build: the output of model inference is not deterministic. If I run it with openblas build, everything is normal.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.16')
('Compiler     :', 'GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)')
('Build        :', ('default', 'Jun 19 2019 07:40:37'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '19.1.1')
('Directory    :', '/usr/local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/Users/arcadia/work/mxnet-1.5/python/mxnet')
Commit hash file "/Users/arcadia/work/mxnet-1.5/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
('Library      :', ['/Users/arcadia/work/mxnet-1.5/python/mxnet/../../lib/libmxnet.so'])
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✖ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✖ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
('Platform     :', 'Darwin-18.2.0-x86_64-i386-64bit')
('system       :', 'Darwin')
('node         :', 'MacBookPro')
('release      :', '18.2.0')
('version      :', 'Darwin Kernel Version 18.2.0: Fri Oct  5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64')

Build info (Required if built from source)

MXNet commit hash:
latest commit
4d07d78

Minimum reproducible example

I've slightly modified the cpp example to print the first 10 numbers in the output ndarray:
download the code and replace the example/image-classification/predict-cpp folder.

In order to run the example, the changes in PR #15574 is needed to patch the mxnet code.

MXNET_ENGINE_TYPE=NaiveEngine ./image-classification-predict path/to/image [thread number]

The output

openblas 2 threads:

./model/mobilenetv2_0.25-symbol.json ... 146314 bytes
./model/mobilenetv2_0.25-0000.params ... 6135676 bytes
[02:11:17] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
3.16592 -2.01711 -3.29867 -10.9845 -1.56596 2.30844 -4.57165 -0.211675 0.300732 -4.35327
3.16592 -2.01711 -3.29867 -10.9845 -1.56596 2.30844 -4.57165 -0.211675 0.300732 -4.35327
run successfully

mkldnn 2 threads:

./model/mobilenetv2_0.25-symbol.json ... 146314 bytes
./model/mobilenetv2_0.25-0000.params ... 6135676 bytes
[02:12:40] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
-4.33902 -8.3967 -3.97791 -2.16607 -7.36041 -7.65456 -6.08846 -3.10948 -6.65086 4.17571
-0.0463818 0.440686 -0.907998 -7.17286 4.8312 -3.03199 4.31585 0.0692099 -0.844623 -8.1773
run successfully

The results change randomly with every execution.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-07-17T18:18:18Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

arcadiaphy · 2019-07-17T18:19:38Z

@pengzhao-intel

marcoabreu · 2019-07-17T20:54:28Z

MXNet does not support multithreading from the interface level. Not even locking based access. The only way to use MXNet in a multi-threaded fashion is by using a jobqueue that is consumed by a sticky thread.

pengzhao-intel · 2019-07-17T23:52:43Z

@wuxun-zhang @ZhennanQin please help take a look for this issue, thanks.

arcadiaphy · 2019-07-18T07:14:18Z

@pengzhao-intel @wuxun-zhang @ZhennanQin

The difference between MXPredCreateMultiThread and create executor separately is that the former shares model parameters. I've tested that if the weight sharing is disabled by moving arg_arrays and aux_arrays creation into for loop, then the result is normal even with MKLDNN.

  for (int i = 0; i < num_threads; i++) {
    std::vector<NDArray> arg_arrays, aux_arrays;
    for (size_t i = 0; i < arg_shapes.size(); ++i) {
      NDArray nd = NDArray(arg_shapes[i], ctx);
      if (arg_params.count(arg_names[i]) != 0) {
        CopyFromTo(arg_params[arg_names[i]], &nd);
      }
      arg_arrays.push_back(nd);
    }
    for (size_t i = 0; i < aux_shapes.size(); ++i) {
      NDArray nd = NDArray(aux_shapes[i], ctx);
      if (aux_params.count(aux_names[i]) != 0) {
        CopyFromTo(aux_params[aux_names[i]], &nd);
      }
      aux_arrays.push_back(nd);
    }

    std::unique_ptr<MXAPIPredictor> ret(new MXAPIPredictor());
    ret->sym = sym;
    ret->ctx = ctx;
    ret->key2arg = key2arg;
    ret->arg_arrays = arg_arrays;
    ret->aux_arrays = aux_arrays;
    ret->out_shapes = out_shapes;

    if (!lazy) {
      std::map<std::string, Context> ctx_map;
      std::vector<NDArray> grad_store(arg_arrays.size());
      std::vector<OpReqType> grad_req(arg_arrays.size(), kNullOp);
      ret->exec.reset(Executor::Bind(sym, ctx, ctx_map,
                                     arg_arrays,
                                     grad_store, grad_req,
                                     aux_arrays));
      ret->out_arrays = ret->exec->outputs();
    }
    out[i] = ret.release();
  }

It's very strange, I think the model parameters are read-only, will it affect MKLDNN calling?

ZhennanQin · 2019-07-18T07:30:08Z

For MKLDNN, it doesn't support multi-threading before v1.0 because it shares internal scratch memory for all operators. So simultaneously running 2 mkldnn operators in same process can't guarantee to provide correct result. Currently we suggest to switch to multi-instance with shared memory for multi-threading purpose.

arcadiaphy · 2019-07-18T08:47:12Z

@ZhennanQin I think MKLDNN is OK in multi-threading, the possible reason is that calling mkldnn related method like MKLDNNDataReorder and GetMKLDNNData on ndarray simultaneously messes up the internal mkl_mem_. Also I'm not sure whether mkl_mem_ can be shared in multiple computation.

arcadiaphy · 2019-07-18T09:00:56Z

Similar race condition on ndarray: #9862

arcadiaphy · 2019-07-18T12:51:20Z

Looks like the MKLDNNDataReorder and Reorder2Default will alter the underling data of ndarray, so model parameters are not read-only. I've tried adding a mutex lock on MKLDNNDataReorder, then the problem vanishes.

@zheng-da, can we remove the memcpy in these two methods and make model weights truly read-only? It will fix the broken parallel inference example.

pengzhao-intel · 2019-08-02T05:44:45Z

@arcadiaphy did you local solution work for this issue?

arcadiaphy · 2019-08-05T07:23:51Z

@pengzhao-intel My local solution works fine, but it's not suitable for an official PR.

arcadiaphy added MKLDNN Thread Safety labels Jul 17, 2019

arcadiaphy mentioned this issue Jul 18, 2019

fix naive engine for multi-threaded inference #15574

Merged

7 tasks

ZhennanQin mentioned this issue Jul 19, 2019

[WIP] Add NaiveEnginePerThread #15604

Closed

7 tasks

pengzhao-intel closed this as completed Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded inference broken with MKLDNN #15576

Multi-threaded inference broken with MKLDNN #15576

arcadiaphy commented Jul 17, 2019 •

edited

Loading

mxnet-label-bot commented Jul 17, 2019

arcadiaphy commented Jul 17, 2019

marcoabreu commented Jul 17, 2019

pengzhao-intel commented Jul 17, 2019

arcadiaphy commented Jul 18, 2019 •

edited

Loading

ZhennanQin commented Jul 18, 2019

arcadiaphy commented Jul 18, 2019 •

edited

Loading

arcadiaphy commented Jul 18, 2019

arcadiaphy commented Jul 18, 2019 •

edited

Loading

pengzhao-intel commented Aug 2, 2019

arcadiaphy commented Aug 5, 2019

Multi-threaded inference broken with MKLDNN #15576

Multi-threaded inference broken with MKLDNN #15576

Comments

arcadiaphy commented Jul 17, 2019 • edited Loading

Description

Environment info (Required)

Build info (Required if built from source)

Minimum reproducible example

The output

mxnet-label-bot commented Jul 17, 2019

arcadiaphy commented Jul 17, 2019

marcoabreu commented Jul 17, 2019

pengzhao-intel commented Jul 17, 2019

arcadiaphy commented Jul 18, 2019 • edited Loading

ZhennanQin commented Jul 18, 2019

arcadiaphy commented Jul 18, 2019 • edited Loading

arcadiaphy commented Jul 18, 2019

arcadiaphy commented Jul 18, 2019 • edited Loading

pengzhao-intel commented Aug 2, 2019

arcadiaphy commented Aug 5, 2019

arcadiaphy commented Jul 17, 2019 •

edited

Loading

arcadiaphy commented Jul 18, 2019 •

edited

Loading

arcadiaphy commented Jul 18, 2019 •

edited

Loading

arcadiaphy commented Jul 18, 2019 •

edited

Loading