Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix a race condition in converting data layouts in MKLDNN. #9862

Merged
merged 17 commits into from
Mar 1, 2018

Conversation

zheng-da
Copy link
Contributor

@zheng-da zheng-da commented Feb 22, 2018

Description

There is a race condition in data layout conversion in the MKLDNN implementation.
Currently, when NDArray::data() is called, it converts data format (from the MKLDNN format to the default format) inside an NDArray. In the threaded execution engine, while an NDArray is being converted in a thread, the array can also be read by another thread. In this case, the other thread can read wrong data. A similar race condition may also exist in the MKLML integration, which sometimes generates the garbage results.

Please see the details of the error in #9820

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@zheng-da zheng-da changed the title Fix a race condition in converting data layouts. Fix a race condition in converting data layouts in MKLDNN. Feb 22, 2018
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments.

I have seen that you edited a test. Does this test cover the root of this problem or was it just a general improvement?

@@ -1017,6 +1017,7 @@ inline void CopyFromToDnsImpl(const NDArray& from, const NDArray& to, RunContext
// with Copy().
NDArray tmp_from = from;
if (tmp_from.IsMKLDNNData()) {
// TODO(zhengda) tmp_from should be cached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue to track this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a github issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ideally, there should be no TODOs at all. But in cases like this it's acceptable to have them, but they must be tracked as otherwise we lose overview.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that I saw TODO in many places in the code. Are all TODOs tracked in github issues? Or is this the new practice we want to push? And is github issue the best place for us to keep track of these TODOs? I remember Sheng has a script that closes issues if they aren't inactive for 30 days or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think it's best practice and that we should lead by example :) In the last months I haven't seen any TODO (besides the ones in your MKLDNN PR) checked into MXNet.

We got rid of the script to automatically close issues. For now, GitHub issues are the right place to track these things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many decaffeinated brands out there that are just as tasty as the real thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever you guys are talking here... I think it's a pretty basic thing that immutable means immutable and that a getter should never modify the underlying data in any way. I have no clue about C++, but I'm quite certain that this constraint is quite clearly defined across the industry. Could we please get back to a productive discussion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larroy I agree with you that the current implementation of NDArray to handle storages is messy. What you said requires a rewrite of NDArray. @piiswrong and I talked about this, but I don't think this should be in this PR. We need to be very careful when rewriting NDArray, considering NDArray is one of the core components in MXNet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-da I didn't mean it to be included in this PR. I was just asking. Anyway thanks for your responses. I didn't get the quote about the brands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open TODO

@with_seed()
def test_inference():
all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3',
'densenet201', 'squeezenet1.0', 'mobilenet0.25']
mx.random.seed(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use mx.random.seed due to the introduction of the @with_seed() decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does with_seed() set the same seed for each run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you want to set it the same for each run?

@@ -105,6 +107,7 @@ def test_training():
# TODO(zhengda) mobilenet can't pass this test even without MKLDNN.
all_models = ['resnet18_v1', 'densenet121']

mx.random.seed(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use mx.random.seed due to the introduction of the @with_seed() decorator

@zheng-da
Copy link
Contributor Author

@marcoabreu previously, I disabled the inference test. Now I enabled all tests. I also added some prints to clearly see whether CPU or GPU generates wrong results.

in_blobs[i] = inputs[i].data();
} else {
in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be nice to call in_bufs.reserve(in_blobs.size()) on this first pass (ie if in_bufs.empty())
If there is often more than one in_bufs to be inserted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw reserve() being used in many places in MXNet. I'm always wondering how much benefit it brings. Or is this a convention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents reallocations, heap fragmentation, and contention if you know the size, it's definitely good to use reserve in advance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you add items to a vector, when it runs out of space for the items (which generally happens on the second item sometimes), it allocates memory for 2x the current number of items and then copies all of the items (copy constructor or move, depending on a number of things), into to the new space and then frees the old space. this can happen over and over again, and the whole reallocate and copy can be very expensive. they tend to even show up as hotspots in vtune. so if you know how many items a vector may have, it’s always a good idea to call reserve() because otherwise you are guaranteed a lot of extra reallocing And copying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations vary, but this gives a general idea:

http://www.drdobbs.com/c-made-easier-how-vectors-grow/184401375

} else {
in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(),
false, inputs[i].dtype());
auto mem = inputs[i].GetMKLDNNData();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: auto should be used for obvious types (ie auto foo = static_cast<Bar *>(bar)) or stuff like iterators which are super-long. This is just a simnple var. For readability here, it's be awesome to see what this type actually is.

@with_seed()
def test_inference():
all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3',
'densenet201', 'squeezenet1.0', 'mobilenet0.25']
mx.random.seed(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you want to set it the same for each run?

@marcoabreu
Copy link
Contributor

But would this re-enabled test have caught this error? Please explain how we're preventing this problem from reoccurring.

@zheng-da
Copy link
Contributor Author

@cjolivier01 The seed is set so we know what is the expected result. It's easier to tell whether CPU or GPU compute wrong results.

@zheng-da
Copy link
Contributor Author

@marcoabreu enabling the tests can catch the error more easily.

@marcoabreu
Copy link
Contributor

We don't need fixed seeds since #9791 has been merged. Just make sure that the with_seed decorator is properly applied.

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove custom seed

@@ -37,8 +37,7 @@ def download_data():
return mx.test_utils.download(
'http://data.mxnet.io/data/val-5k-256.rec', VAL_DATA)

@unittest.skip("test fails intermittently. temporarily disabled.")
@with_seed()
@with_seed(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove custom seed

@@ -99,7 +100,7 @@ def get_nn_model(name):
# Seed 1521019752 produced a failure on the Py2 MKLDNN-GPU CI runner
# on 2/16/2018 that was not reproducible. Problem could be timing related or
# based on non-deterministic algo selection.
@with_seed()
@with_seed(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove custom seed

@marcoabreu
Copy link
Contributor

I'm afraid I still don't really understand. I see that you have re-enabled this test, but unfortunately I'm lacking deep technical knowledge of the underlying operations, so would you mind explaining me how this re-enabled test would have caught this failure?

@zheng-da
Copy link
Contributor Author

The reason I disabled the inference tests because I previously thought the failure was related to numeric errors and these are invalid tests. Now it's clear to me that the failures are caused by race condition. It's very hard to reproduce the failure, as you probably have seen. Adding the tests back increases the chance of reproducing the failure if the race condition still exists.

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 23, 2018

Would it be possible to create a test that introduces a delay or uses a for-loop in order to force this race condition?

@zheng-da
Copy link
Contributor Author

It's very difficult to reproduce a race condition in a deterministic way if it's possible.

@cjolivier01
Copy link
Member

if you run the test in a loop (from python) 1000 times, does it fail?

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@zheng-da
Copy link
Contributor Author

it seems the current modification still can't get rid of all race conditions in the code. the reason is that we want to reorder the data of weight arrays in place so that we can avoid reordering them again during inference. Right now, if I ran the tests hundreds of times, I might be able to see inconsistency once. @piiswrong and I have discussed a solution to get rid of remaining race conditions. it should work.

@cjolivier01
Copy link
Member

one thing i’ve done in the past to try and make race conditions happen more often is start changing the process affinity mask, like for example, force the process to use one or two cpus only. this is separate from OMP.

@larroy
Copy link
Contributor

larroy commented Feb 23, 2018

Is this need to reorder data documented somewhere?

@zheng-da
Copy link
Contributor Author

zheng-da commented Feb 23, 2018

@cjolivier01 why race condition happens more frequently when threads run in a smaller number of CPU cores? It seems to me that it happens more often in more CPU cores because we want two threads to access the same data simultaneously.

@zheng-da
Copy link
Contributor Author

zheng-da commented Feb 23, 2018

@cjolivier01
Copy link
Member

@zheng-da Well, you can vary the cores up or down just to change the timing and try to make the race condition happen more often by changing the timing to a point where it's especially aggravated.

I have found, however, that reducing the number of cores that actually run the program (not the number of threads) can cause more overlap between threads. By this, I mean, each thread gets a smaller timeslice of actual execution, and so things that might "overlap" tend to overlap with more frequency. Think of two trains passing each other in opposite directions (threads). If the trains are going fast, they are only next to each other for a short period of time. But if the two trains are going slowly, then they are next to each other for a much longer period of time.

@cjolivier01
Copy link
Member

That's not to say the same trick works for any race condition (fewer cores). But varying the processor affinity in either direction can help to flush out race conditions in general.

@TaoLv
Copy link
Member

TaoLv commented Feb 27, 2018

@zheng-da, if needed, please also update the design document for mkldnn integration. Thanks.

@zheng-da
Copy link
Contributor Author

@TaoLv I have updated the design doc to explain why we need data layout conversion.


CHECK(shandle.size >= pd.get_size());
CheckAndAlloc(pd.get_size());
// TODO(zhengda) We need to avoid memory copy here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the previous PR. I just moved code here.

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still two open TODOs. They sound like they could harm performance due to unnecessary memread/copy. If they're not on the critical path or have no big impact on performance I'm fine with merging, otherwise I'd like to see them resolved.

@zheng-da
Copy link
Contributor Author

zheng-da commented Feb 28, 2018

@marcoabreu Reorder2Default and MKLDNNDataReorder shouldn't be called frequently. They are not in the critical path. The whole point of this PR is to further remove the invocation of these two methods.

Creating temporary arrays isn't in the critical path either. It's used in a very special case: copy MKLDNN data to GPU memory.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Feb 28, 2018

@zheng-da @marcoabreu @cjolivier01 because there're lots of changes in MKL-DNN, please kindly wait a moment to merge.

@juliusshufan will test the performance and coverage of these changes.

@zheng-da
Copy link
Contributor Author

@cjolivier01 do you have more comments?
@piiswrong do you want to review the code?

The PR should have fixed the bug in #9820.
I ran tests/python/gpu/test_gluon_model_zoo_gpu.py again 1000 times and didn't see any failure. I also ran tests/python/gpu/test_operator_gpu.py many times (~40 times so far) and saw two failures, but none of them are caused by this race condition. The failures in test_operator_gpu.py are caused by floating-point precision. I'll report them in another issue.

@piiswrong
Copy link
Contributor

LGTM.
@pengzhao-intel have you finished testing?

@cjolivier01
Copy link
Member

LGTM

@juliusshufan
Copy link
Contributor

juliusshufan commented Mar 1, 2018

@zheng-da @cjolivier01 @piiswrong @pengzhao-intel sorry for late response.
The test result is positive, with 300 epoc, resnet50+cifar10
top1 validation accuracy is 0.923478
top1 training accuracy is 0.987500

Thanks.

@pengzhao-intel
Copy link
Contributor

The testing is done. There is NO coverage and performance regression issue from our tests.
I think the code is qualified to merge @piiswrong :)

@juliusshufan will update the performance data in here soon.

Thank you for @zheng-da's great work.

@marcoabreu marcoabreu merged commit f9c2689 into apache:master Mar 1, 2018
@marcoabreu
Copy link
Contributor

Thanks a lot everybody!

@juliusshufan
Copy link
Contributor

juliusshufan commented Mar 1, 2018

Update the inference perf comparison, a slight drop on ResNet network, no outstanding changes on other networks.

image

@zheng-da zheng-da deleted the fix_mkldnn_bugs branch March 5, 2018 16:02
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* Fix a race condition in converting data layouts.

* Avoid calling data() in elemwise sum.

* Fix a compilation error.

* Address comments.

* avoid data layout conversion inside ndarray.

* Fix a compilation error.

* address comments.

* Reorder weight arrays in convolution async.

* Fix async data reordering in NDArray.

* Fix race condition in deconv.

* Update ndarray.cc

* Check more in NDArray.

* Fix a bug in MKLDNNDataReorder.

* Fix a bug in NDArray.

* Simplify weight reorder in (de-)conv.
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Fix a race condition in converting data layouts.

* Avoid calling data() in elemwise sum.

* Fix a compilation error.

* Address comments.

* avoid data layout conversion inside ndarray.

* Fix a compilation error.

* address comments.

* Reorder weight arrays in convolution async.

* Fix async data reordering in NDArray.

* Fix race condition in deconv.

* Update ndarray.cc

* Check more in NDArray.

* Fix a bug in MKLDNNDataReorder.

* Fix a bug in NDArray.

* Simplify weight reorder in (de-)conv.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Fix a race condition in converting data layouts.

* Avoid calling data() in elemwise sum.

* Fix a compilation error.

* Address comments.

* avoid data layout conversion inside ndarray.

* Fix a compilation error.

* address comments.

* Reorder weight arrays in convolution async.

* Fix async data reordering in NDArray.

* Fix race condition in deconv.

* Update ndarray.cc

* Check more in NDArray.

* Fix a bug in MKLDNNDataReorder.

* Fix a bug in NDArray.

* Simplify weight reorder in (de-)conv.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants