-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix a race condition in converting data layouts in MKLDNN. #9862
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments.
I have seen that you edited a test. Does this test cover the root of this problem or was it just a general improvement?
@@ -1017,6 +1017,7 @@ inline void CopyFromToDnsImpl(const NDArray& from, const NDArray& to, RunContext | |||
// with Copy(). | |||
NDArray tmp_from = from; | |||
if (tmp_from.IsMKLDNNData()) { | |||
// TODO(zhengda) tmp_from should be cached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create an issue to track this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a github issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Ideally, there should be no TODOs at all. But in cases like this it's acceptable to have them, but they must be tracked as otherwise we lose overview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I ask is that I saw TODO in many places in the code. Are all TODOs tracked in github issues? Or is this the new practice we want to push? And is github issue the best place for us to keep track of these TODOs? I remember Sheng has a script that closes issues if they aren't inactive for 30 days or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think it's best practice and that we should lead by example :) In the last months I haven't seen any TODO (besides the ones in your MKLDNN PR) checked into MXNet.
We got rid of the script to automatically close issues. For now, GitHub issues are the right place to track these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many decaffeinated brands out there that are just as tasty as the real thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever you guys are talking here... I think it's a pretty basic thing that immutable means immutable and that a getter should never modify the underlying data in any way. I have no clue about C++, but I'm quite certain that this constraint is quite clearly defined across the industry. Could we please get back to a productive discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larroy I agree with you that the current implementation of NDArray to handle storages is messy. What you said requires a rewrite of NDArray. @piiswrong and I talked about this, but I don't think this should be in this PR. We need to be very careful when rewriting NDArray, considering NDArray is one of the core components in MXNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-da I didn't mean it to be included in this PR. I was just asking. Anyway thanks for your responses. I didn't get the quote about the brands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open TODO
@with_seed() | ||
def test_inference(): | ||
all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3', | ||
'densenet201', 'squeezenet1.0', 'mobilenet0.25'] | ||
mx.random.seed(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use mx.random.seed due to the introduction of the @with_seed() decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does with_seed() set the same seed for each run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you want to set it the same for each run?
@@ -105,6 +107,7 @@ def test_training(): | |||
# TODO(zhengda) mobilenet can't pass this test even without MKLDNN. | |||
all_models = ['resnet18_v1', 'densenet121'] | |||
|
|||
mx.random.seed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use mx.random.seed due to the introduction of the @with_seed() decorator
@marcoabreu previously, I disabled the inference test. Now I enabled all tests. I also added some prints to clearly see whether CPU or GPU generates wrong results. |
in_blobs[i] = inputs[i].data(); | ||
} else { | ||
in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would be nice to call in_bufs.reserve(in_blobs.size()) on this first pass (ie if in_bufs.empty())
If there is often more than one in_bufs to be inserted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw reserve() being used in many places in MXNet. I'm always wondering how much benefit it brings. Or is this a convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prevents reallocations, heap fragmentation, and contention if you know the size, it's definitely good to use reserve in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you add items to a vector, when it runs out of space for the items (which generally happens on the second item sometimes), it allocates memory for 2x the current number of items and then copies all of the items (copy constructor or move, depending on a number of things), into to the new space and then frees the old space. this can happen over and over again, and the whole reallocate and copy can be very expensive. they tend to even show up as hotspots in vtune. so if you know how many items a vector may have, it’s always a good idea to call reserve() because otherwise you are guaranteed a lot of extra reallocing And copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementations vary, but this gives a general idea:
http://www.drdobbs.com/c-made-easier-how-vectors-grow/184401375
} else { | ||
in_bufs.emplace_back(inputs[i].shape(), inputs[i].ctx(), | ||
false, inputs[i].dtype()); | ||
auto mem = inputs[i].GetMKLDNNData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: auto should be used for obvious types (ie auto foo = static_cast<Bar *>(bar)) or stuff like iterators which are super-long. This is just a simnple var. For readability here, it's be awesome to see what this type actually is.
@with_seed() | ||
def test_inference(): | ||
all_models = ['resnet50_v1', 'vgg19_bn', 'alexnet', #'inceptionv3', | ||
'densenet201', 'squeezenet1.0', 'mobilenet0.25'] | ||
mx.random.seed(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you want to set it the same for each run?
But would this re-enabled test have caught this error? Please explain how we're preventing this problem from reoccurring. |
@cjolivier01 The seed is set so we know what is the expected result. It's easier to tell whether CPU or GPU compute wrong results. |
@marcoabreu enabling the tests can catch the error more easily. |
We don't need fixed seeds since #9791 has been merged. Just make sure that the with_seed decorator is properly applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove custom seed
@@ -37,8 +37,7 @@ def download_data(): | |||
return mx.test_utils.download( | |||
'http://data.mxnet.io/data/val-5k-256.rec', VAL_DATA) | |||
|
|||
@unittest.skip("test fails intermittently. temporarily disabled.") | |||
@with_seed() | |||
@with_seed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove custom seed
@@ -99,7 +100,7 @@ def get_nn_model(name): | |||
# Seed 1521019752 produced a failure on the Py2 MKLDNN-GPU CI runner | |||
# on 2/16/2018 that was not reproducible. Problem could be timing related or | |||
# based on non-deterministic algo selection. | |||
@with_seed() | |||
@with_seed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove custom seed
I'm afraid I still don't really understand. I see that you have re-enabled this test, but unfortunately I'm lacking deep technical knowledge of the underlying operations, so would you mind explaining me how this re-enabled test would have caught this failure? |
The reason I disabled the inference tests because I previously thought the failure was related to numeric errors and these are invalid tests. Now it's clear to me that the failures are caused by race condition. It's very hard to reproduce the failure, as you probably have seen. Adding the tests back increases the chance of reproducing the failure if the race condition still exists. |
Would it be possible to create a test that introduces a delay or uses a for-loop in order to force this race condition? |
It's very difficult to reproduce a race condition in a deterministic way if it's possible. |
if you run the test in a loop (from python) 1000 times, does it fail? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
it seems the current modification still can't get rid of all race conditions in the code. the reason is that we want to reorder the data of weight arrays in place so that we can avoid reordering them again during inference. Right now, if I ran the tests hundreds of times, I might be able to see inconsistency once. @piiswrong and I have discussed a solution to get rid of remaining race conditions. it should work. |
one thing i’ve done in the past to try and make race conditions happen more often is start changing the process affinity mask, like for example, force the process to use one or two cpus only. this is separate from OMP. |
Is this need to reorder data documented somewhere? |
@cjolivier01 why race condition happens more frequently when threads run in a smaller number of CPU cores? It seems to me that it happens more often in more CPU cores because we want two threads to access the same data simultaneously. |
@larroy this is the design doc of mkldnn: https://cwiki.apache.org/confluence/display/MXNET/The+design+of+MKLDNN+integration |
@zheng-da Well, you can vary the cores up or down just to change the timing and try to make the race condition happen more often by changing the timing to a point where it's especially aggravated. I have found, however, that reducing the number of cores that actually run the program (not the number of threads) can cause more overlap between threads. By this, I mean, each thread gets a smaller timeslice of actual execution, and so things that might "overlap" tend to overlap with more frequency. Think of two trains passing each other in opposite directions (threads). If the trains are going fast, they are only next to each other for a short period of time. But if the two trains are going slowly, then they are next to each other for a much longer period of time. |
That's not to say the same trick works for any race condition (fewer cores). But varying the processor affinity in either direction can help to flush out race conditions in general. |
@zheng-da, if needed, please also update the design document for mkldnn integration. Thanks. |
@TaoLv I have updated the design doc to explain why we need data layout conversion. |
|
||
CHECK(shandle.size >= pd.get_size()); | ||
CheckAndAlloc(pd.get_size()); | ||
// TODO(zhengda) We need to avoid memory copy here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the previous PR. I just moved code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still two open TODOs. They sound like they could harm performance due to unnecessary memread/copy. If they're not on the critical path or have no big impact on performance I'm fine with merging, otherwise I'd like to see them resolved.
@marcoabreu Reorder2Default and MKLDNNDataReorder shouldn't be called frequently. They are not in the critical path. The whole point of this PR is to further remove the invocation of these two methods. Creating temporary arrays isn't in the critical path either. It's used in a very special case: copy MKLDNN data to GPU memory. |
@zheng-da @marcoabreu @cjolivier01 because there're lots of changes in MKL-DNN, please kindly wait a moment to merge. @juliusshufan will test the performance and coverage of these changes. |
@cjolivier01 do you have more comments? The PR should have fixed the bug in #9820. |
LGTM. |
LGTM |
@zheng-da @cjolivier01 @piiswrong @pengzhao-intel sorry for late response. Thanks. |
The testing is done. There is NO coverage and performance regression issue from our tests. @juliusshufan will update the performance data in here soon. Thank you for @zheng-da's great work. |
Thanks a lot everybody! |
* Fix a race condition in converting data layouts. * Avoid calling data() in elemwise sum. * Fix a compilation error. * Address comments. * avoid data layout conversion inside ndarray. * Fix a compilation error. * address comments. * Reorder weight arrays in convolution async. * Fix async data reordering in NDArray. * Fix race condition in deconv. * Update ndarray.cc * Check more in NDArray. * Fix a bug in MKLDNNDataReorder. * Fix a bug in NDArray. * Simplify weight reorder in (de-)conv.
* Fix a race condition in converting data layouts. * Avoid calling data() in elemwise sum. * Fix a compilation error. * Address comments. * avoid data layout conversion inside ndarray. * Fix a compilation error. * address comments. * Reorder weight arrays in convolution async. * Fix async data reordering in NDArray. * Fix race condition in deconv. * Update ndarray.cc * Check more in NDArray. * Fix a bug in MKLDNNDataReorder. * Fix a bug in NDArray. * Simplify weight reorder in (de-)conv.
* Fix a race condition in converting data layouts. * Avoid calling data() in elemwise sum. * Fix a compilation error. * Address comments. * avoid data layout conversion inside ndarray. * Fix a compilation error. * address comments. * Reorder weight arrays in convolution async. * Fix async data reordering in NDArray. * Fix race condition in deconv. * Update ndarray.cc * Check more in NDArray. * Fix a bug in MKLDNNDataReorder. * Fix a bug in NDArray. * Simplify weight reorder in (de-)conv.
Description
There is a race condition in data layout conversion in the MKLDNN implementation.
Currently, when NDArray::data() is called, it converts data format (from the MKLDNN format to the default format) inside an NDArray. In the threaded execution engine, while an NDArray is being converted in a thread, the array can also be read by another thread. In this case, the other thread can read wrong data. A similar race condition may also exist in the MKLML integration, which sometimes generates the garbage results.
Please see the details of the error in #9820
Checklist
Essentials
make lint
)