Fix a race condition in converting data layouts in MKLDNN. #9862

zheng-da · 2018-02-22T21:36:15Z

Description

There is a race condition in data layout conversion in the MKLDNN implementation.
Currently, when NDArray::data() is called, it converts data format (from the MKLDNN format to the default format) inside an NDArray. In the threaded execution engine, while an NDArray is being converted in a thread, the array can also be read by another thread. In this case, the other thread can read wrong data. A similar race condition may also exist in the MKLML integration, which sometimes generates the garbage results.

Please see the details of the error in #9820

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

marcoabreu

See my comments.

I have seen that you edited a test. Does this test cover the root of this problem or was it just a general improvement?

marcoabreu · 2018-02-22T22:25:24Z

src/ndarray/ndarray.cc

@@ -1017,6 +1017,7 @@ inline void CopyFromToDnsImpl(const NDArray& from, const NDArray& to, RunContext
    // with Copy().
    NDArray tmp_from = from;
    if (tmp_from.IsMKLDNNData()) {
+      // TODO(zhengda) tmp_from should be cached.


Please create an issue to track this

Create a github issue for this?

Yes. Ideally, there should be no TODOs at all. But in cases like this it's acceptable to have them, but they must be tracked as otherwise we lose overview.

The reason I ask is that I saw TODO in many places in the code. Are all TODOs tracked in github issues? Or is this the new practice we want to push? And is github issue the best place for us to keep track of these TODOs? I remember Sheng has a script that closes issues if they aren't inactive for 30 days or so.

In think it's best practice and that we should lead by example :) In the last months I haven't seen any TODO (besides the ones in your MKLDNN PR) checked into MXNet.

We got rid of the script to automatically close issues. For now, GitHub issues are the right place to track these things.

There are many decaffeinated brands out there that are just as tasty as the real thing.

Whatever you guys are talking here... I think it's a pretty basic thing that immutable means immutable and that a getter should never modify the underlying data in any way. I have no clue about C++, but I'm quite certain that this constraint is quite clearly defined across the industry. Could we please get back to a productive discussion?

@larroy I agree with you that the current implementation of NDArray to handle storages is messy. What you said requires a rewrite of NDArray. @piiswrong and I talked about this, but I don't think this should be in this PR. We need to be very careful when rewriting NDArray, considering NDArray is one of the core components in MXNet.

@zheng-da I didn't mean it to be included in this PR. I was just asking. Anyway thanks for your responses. I didn't get the quote about the brands.

marcoabreu · 2018-02-22T22:26:25Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

 @with_seed()
 def test_inference():
    all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3',
                  'densenet201', 'squeezenet1.0', 'mobilenet0.25']
+    mx.random.seed(2)


please don't use mx.random.seed due to the introduction of the @with_seed() decorator

Does with_seed() set the same seed for each run?

Why would you want to set it the same for each run?

marcoabreu · 2018-02-22T22:26:36Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

@@ -105,6 +107,7 @@ def test_training():
    # TODO(zhengda) mobilenet can't pass this test even without MKLDNN.
    all_models = ['resnet18_v1', 'densenet121']

+    mx.random.seed(1)


please don't use mx.random.seed due to the introduction of the @with_seed() decorator

zheng-da · 2018-02-22T22:33:36Z

@marcoabreu previously, I disabled the inference test. Now I enabled all tests. I also added some prints to clearly see whether CPU or GPU generates wrong results.

cjolivier01 · 2018-02-22T22:41:39Z

src/operator/nn/mkldnn/mkldnn_base.cc

      in_blobs[i] = inputs[i].data();
+    } else {
+      in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(),


nit: it would be nice to call in_bufs.reserve(in_blobs.size()) on this first pass (ie if in_bufs.empty())
If there is often more than one in_bufs to be inserted

I saw reserve() being used in many places in MXNet. I'm always wondering how much benefit it brings. Or is this a convention?

It prevents reallocations, heap fragmentation, and contention if you know the size, it's definitely good to use reserve in advance.

when you add items to a vector, when it runs out of space for the items (which generally happens on the second item sometimes), it allocates memory for 2x the current number of items and then copies all of the items (copy constructor or move, depending on a number of things), into to the new space and then frees the old space. this can happen over and over again, and the whole reallocate and copy can be very expensive. they tend to even show up as hotspots in vtune. so if you know how many items a vector may have, it’s always a good idea to call reserve() because otherwise you are guaranteed a lot of extra reallocing And copying.

implementations vary, but this gives a general idea:

http://www.drdobbs.com/c-made-easier-how-vectors-grow/184401375

cjolivier01 · 2018-02-22T22:43:21Z

src/operator/nn/mkldnn/mkldnn_base.cc

+    } else {
+      in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(),
+                           false, inputs[i].dtype());
+      auto mem = inputs[i].GetMKLDNNData();


nit: auto should be used for obvious types (ie auto foo = static_cast<Bar *>(bar)) or stuff like iterators which are super-long. This is just a simnple var. For readability here, it's be awesome to see what this type actually is.

cjolivier01 · 2018-02-22T22:45:20Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

 @with_seed()
 def test_inference():
    all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3',
                  'densenet201', 'squeezenet1.0', 'mobilenet0.25']
+    mx.random.seed(2)


Why would you want to set it the same for each run?

marcoabreu · 2018-02-22T22:53:17Z

But would this re-enabled test have caught this error? Please explain how we're preventing this problem from reoccurring.

zheng-da · 2018-02-22T23:47:28Z

@cjolivier01 The seed is set so we know what is the expected result. It's easier to tell whether CPU or GPU compute wrong results.

zheng-da · 2018-02-22T23:48:15Z

@marcoabreu enabling the tests can catch the error more easily.

marcoabreu · 2018-02-23T00:04:48Z

We don't need fixed seeds since #9791 has been merged. Just make sure that the with_seed decorator is properly applied.

marcoabreu

Remove custom seed

marcoabreu · 2018-02-23T00:07:34Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

@@ -37,8 +37,7 @@ def download_data():
    return mx.test_utils.download(
        'http://data.mxnet.io/data/val-5k-256.rec', VAL_DATA)

-@unittest.skip("test fails intermittently. temporarily disabled.")
-@with_seed()
+@with_seed(1)


Please remove custom seed

marcoabreu · 2018-02-23T00:08:05Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

@@ -99,7 +100,7 @@ def get_nn_model(name):
 # Seed 1521019752 produced a failure on the Py2 MKLDNN-GPU CI runner
 # on 2/16/2018 that was not reproducible.  Problem could be timing related or
 # based on non-deterministic algo selection.
-@with_seed()
+@with_seed(1)


Please remove custom seed

marcoabreu · 2018-02-23T00:10:09Z

I'm afraid I still don't really understand. I see that you have re-enabled this test, but unfortunately I'm lacking deep technical knowledge of the underlying operations, so would you mind explaining me how this re-enabled test would have caught this failure?

zheng-da · 2018-02-23T00:27:34Z

The reason I disabled the inference tests because I previously thought the failure was related to numeric errors and these are invalid tests. Now it's clear to me that the failures are caused by race condition. It's very hard to reproduce the failure, as you probably have seen. Adding the tests back increases the chance of reproducing the failure if the race condition still exists.

marcoabreu · 2018-02-23T00:43:01Z

Would it be possible to create a test that introduces a delay or uses a for-loop in order to force this race condition?

zheng-da · 2018-02-23T01:04:47Z

It's very difficult to reproduce a race condition in a deterministic way if it's possible.

cjolivier01 · 2018-02-23T03:57:34Z

if you run the test in a loop (from python) 1000 times, does it fail?

cjolivier01

.

zheng-da · 2018-02-23T05:27:29Z

it seems the current modification still can't get rid of all race conditions in the code. the reason is that we want to reorder the data of weight arrays in place so that we can avoid reordering them again during inference. Right now, if I ran the tests hundreds of times, I might be able to see inconsistency once. @piiswrong and I have discussed a solution to get rid of remaining race conditions. it should work.

cjolivier01 · 2018-02-23T13:50:32Z

one thing i’ve done in the past to try and make race conditions happen more often is start changing the process affinity mask, like for example, force the process to use one or two cpus only. this is separate from OMP.

larroy · 2018-02-23T16:01:47Z

Is this need to reorder data documented somewhere?

zheng-da · 2018-02-23T18:23:53Z

@cjolivier01 why race condition happens more frequently when threads run in a smaller number of CPU cores? It seems to me that it happens more often in more CPU cores because we want two threads to access the same data simultaneously.

zheng-da · 2018-02-23T18:32:33Z

@larroy this is the design doc of mkldnn: https://cwiki.apache.org/confluence/display/MXNET/The+design+of+MKLDNN+integration

cjolivier01 · 2018-02-23T19:11:24Z

@zheng-da Well, you can vary the cores up or down just to change the timing and try to make the race condition happen more often by changing the timing to a point where it's especially aggravated.

I have found, however, that reducing the number of cores that actually run the program (not the number of threads) can cause more overlap between threads. By this, I mean, each thread gets a smaller timeslice of actual execution, and so things that might "overlap" tend to overlap with more frequency. Think of two trains passing each other in opposite directions (threads). If the trains are going fast, they are only next to each other for a short period of time. But if the two trains are going slowly, then they are next to each other for a much longer period of time.

cjolivier01 · 2018-02-23T19:16:03Z

That's not to say the same trick works for any race condition (fewer cores). But varying the processor affinity in either direction can help to flush out race conditions in general.

…or-mxnet into fix_mkldnn_bugs

TaoLv · 2018-02-27T14:34:28Z

@zheng-da, if needed, please also update the design document for mkldnn integration. Thanks.

zheng-da · 2018-02-27T15:18:30Z

@TaoLv I have updated the design doc to explain why we need data layout conversion.

marcoabreu · 2018-02-28T01:44:14Z

src/ndarray/ndarray.cc

+
+  CHECK(shandle.size >= pd.get_size());
+  CheckAndAlloc(pd.get_size());
+  // TODO(zhengda) We need to avoid memory copy here.


This is from the previous PR. I just moved code here.

marcoabreu

There are still two open TODOs. They sound like they could harm performance due to unnecessary memread/copy. If they're not on the critical path or have no big impact on performance I'm fine with merging, otherwise I'd like to see them resolved.

zheng-da · 2018-02-28T07:11:29Z

@marcoabreu Reorder2Default and MKLDNNDataReorder shouldn't be called frequently. They are not in the critical path. The whole point of this PR is to further remove the invocation of these two methods.

Creating temporary arrays isn't in the critical path either. It's used in a very special case: copy MKLDNN data to GPU memory.

pengzhao-intel · 2018-02-28T09:17:53Z

@zheng-da @marcoabreu @cjolivier01 because there're lots of changes in MKL-DNN, please kindly wait a moment to merge.

@juliusshufan will test the performance and coverage of these changes.

zheng-da · 2018-02-28T10:22:05Z

@cjolivier01 do you have more comments?
@piiswrong do you want to review the code?

The PR should have fixed the bug in #9820.
I ran tests/python/gpu/test_gluon_model_zoo_gpu.py again 1000 times and didn't see any failure. I also ran tests/python/gpu/test_operator_gpu.py many times (~40 times so far) and saw two failures, but none of them are caused by this race condition. The failures in test_operator_gpu.py are caused by floating-point precision. I'll report them in another issue.

piiswrong · 2018-02-28T19:01:31Z

LGTM.
@pengzhao-intel have you finished testing?

cjolivier01 · 2018-03-01T00:15:03Z

LGTM

juliusshufan · 2018-03-01T01:13:49Z

@zheng-da @cjolivier01 @piiswrong @pengzhao-intel sorry for late response.
The test result is positive, with 300 epoc, resnet50+cifar10
top1 validation accuracy is 0.923478
top1 training accuracy is 0.987500

Thanks.

pengzhao-intel · 2018-03-01T10:33:59Z

The testing is done. There is NO coverage and performance regression issue from our tests.
I think the code is qualified to merge @piiswrong :)

@juliusshufan will update the performance data in here soon.

Thank you for @zheng-da's great work.

marcoabreu · 2018-03-01T10:54:50Z

Thanks a lot everybody!

juliusshufan · 2018-03-01T11:32:40Z

Update the inference perf comparison, a slight drop on ResNet network, no outstanding changes on other networks.

* Fix a race condition in converting data layouts. * Avoid calling data() in elemwise sum. * Fix a compilation error. * Address comments. * avoid data layout conversion inside ndarray. * Fix a compilation error. * address comments. * Reorder weight arrays in convolution async. * Fix async data reordering in NDArray. * Fix race condition in deconv. * Update ndarray.cc * Check more in NDArray. * Fix a bug in MKLDNNDataReorder. * Fix a bug in NDArray. * Simplify weight reorder in (de-)conv.

Fix a race condition in converting data layouts.

2756a00

zheng-da requested a review from cjolivier01 as a code owner February 22, 2018 21:36

zheng-da changed the title ~~Fix a race condition in converting data layouts.~~ Fix a race condition in converting data layouts in MKLDNN. Feb 22, 2018

Avoid calling data() in elemwise sum.

03abef4

marcoabreu suggested changes Feb 22, 2018

View reviewed changes

Fix a compilation error.

b919307

cjolivier01 suggested changes Feb 22, 2018

View reviewed changes

Address comments.

fdb3008

marcoabreu suggested changes Feb 23, 2018

View reviewed changes

cjolivier01 reviewed Feb 23, 2018

View reviewed changes

zheng-da added 2 commits February 23, 2018 04:31

avoid data layout conversion inside ndarray.

ee0fbe5

Fix a compilation error.

d6697a7

address comments.

f0f08d8

zheng-da added 2 commits February 27, 2018 13:35

Fix a bug in MKLDNNDataReorder.

c880e61

Merge branch 'fix_mkldnn_bugs' of https://github.com/zheng-da/incubat…

2b85e0f

…or-mxnet into fix_mkldnn_bugs

zheng-da added 2 commits February 27, 2018 15:11

Fix a bug in NDArray.

6572f7d

Simplify weight reorder in (de-)conv.

5625976

marcoabreu reviewed Feb 28, 2018

View reviewed changes

marcoabreu suggested changes Feb 28, 2018

View reviewed changes

marcoabreu approved these changes Feb 28, 2018

View reviewed changes

cjolivier01 approved these changes Mar 1, 2018

View reviewed changes

marcoabreu merged commit f9c2689 into apache:master Mar 1, 2018

marcoabreu mentioned this pull request Mar 1, 2018

Flaky test_gluon_model_zoo_gpu.test_training @ Python3: MKLDNN-GPU #9820

Closed

zheng-da deleted the fix_mkldnn_bugs branch March 5, 2018 16:02

anirudhacharya mentioned this pull request Mar 13, 2018

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

Closed

zheng-da mentioned this pull request Mar 21, 2018

potential race condition in MKLDNN NDArray. #10188

Closed

zheng-da mentioned this pull request Apr 20, 2018

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Closed

sandeep-krishnamurthy mentioned this pull request Jun 29, 2018

flaky test test_gluon_model_zoo_gpu.test_inference #10750

Closed

arcadiaphy mentioned this pull request Jul 18, 2019

Multi-threaded inference broken with MKLDNN #15576

Closed

Fix a race condition in converting data layouts in MKLDNN. #9862

Fix a race condition in converting data layouts in MKLDNN. #9862

Conversation

zheng-da commented Feb 22, 2018 • edited Loading

Description

Checklist

Essentials

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Feb 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Feb 22, 2018

zheng-da commented Feb 22, 2018

zheng-da commented Feb 22, 2018

marcoabreu commented Feb 23, 2018

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Feb 23, 2018

zheng-da commented Feb 23, 2018

marcoabreu commented Feb 23, 2018 • edited Loading

zheng-da commented Feb 23, 2018

cjolivier01 commented Feb 23, 2018

cjolivier01 left a comment

Choose a reason for hiding this comment

zheng-da commented Feb 23, 2018

cjolivier01 commented Feb 23, 2018

larroy commented Feb 23, 2018

zheng-da commented Feb 23, 2018 • edited Loading

zheng-da commented Feb 23, 2018 • edited Loading

cjolivier01 commented Feb 23, 2018

cjolivier01 commented Feb 23, 2018

TaoLv commented Feb 27, 2018

zheng-da commented Feb 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

zheng-da commented Feb 28, 2018 • edited Loading

pengzhao-intel commented Feb 28, 2018 • edited Loading

zheng-da commented Feb 28, 2018

piiswrong commented Feb 28, 2018

cjolivier01 commented Mar 1, 2018

juliusshufan commented Mar 1, 2018 • edited Loading

pengzhao-intel commented Mar 1, 2018

marcoabreu commented Mar 1, 2018

juliusshufan commented Mar 1, 2018 • edited Loading

zheng-da commented Feb 22, 2018 •

edited

Loading

marcoabreu commented Feb 23, 2018 •

edited

Loading

zheng-da commented Feb 23, 2018 •

edited

Loading

zheng-da commented Feb 23, 2018 •

edited

Loading

zheng-da commented Feb 28, 2018 •

edited

Loading

pengzhao-intel commented Feb 28, 2018 •

edited

Loading

juliusshufan commented Mar 1, 2018 •

edited

Loading

juliusshufan commented Mar 1, 2018 •

edited

Loading