A potential race condition in the executor or engine. #10865

zheng-da · 2018-05-09T15:24:46Z

Previously, we encounter a memory error. It was caused by a race condition that the MKLDNN memory in an output NDArray was removed when some MKLDNN operator tried to read the MKLDNN memory from its input arrays. The error was temporarily fixed in #10651

This error can be reproduced in the following command:

export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export MXNET_MODULE_SEED=812478194
export MXNET_TEST_COUNT=10000
nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape

However, the race condition shouldn't happen. The execution engine schedules the execution of computation based on the data dependency. When an operator is scheduled to write data to an output NDArray, any operator that reads data from the NDArray shouldn't be scheduled for execution. But we actually observe that the input array of an operator is modified when the operator is running, which suggests that the race condition can mess up data in the input NDArray even without MKLDNN, but it's harder to notice.

eric-haibin-lin · 2018-05-09T17:01:02Z

which operator is this? Any op registered with nnvm::FMutateInputs can modify inputs

zheng-da · 2018-05-10T05:26:49Z

The problem I observed is in convolution. Its input shouldn't be modified.

Even if the inputs are mutable, modifying the output array of an operator shouldn't affect the inputs of another operator.

azai91 · 2018-07-09T18:09:10Z

@zheng-da ran the above code on the latest master and was not able to reproduce issue. can you see if this is still an issue?

zheng-da · 2018-07-10T18:09:45Z

@azai91 the problem is kind of fixed. But the fix just solves the problem in this test. there exists a race condition somewhere. I still can't figure out where.

al-rigazzi · 2018-09-26T10:36:39Z

@zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times.

I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem?

Thanks,
Al

bgawrych · 2021-07-06T10:55:44Z

@szha I can't reproduce this issue - I think we can close this one as outdated and not reproducible

sandeep-krishnamurthy added Bug MKL labels May 27, 2018

leezu closed this as completed Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential race condition in the executor or engine. #10865

A potential race condition in the executor or engine. #10865

zheng-da commented May 9, 2018

eric-haibin-lin commented May 9, 2018

zheng-da commented May 10, 2018

azai91 commented Jul 9, 2018

zheng-da commented Jul 10, 2018

al-rigazzi commented Sep 26, 2018

bgawrych commented Jul 6, 2021

A potential race condition in the executor or engine. #10865

A potential race condition in the executor or engine. #10865

Comments

zheng-da commented May 9, 2018

eric-haibin-lin commented May 9, 2018

zheng-da commented May 10, 2018

azai91 commented Jul 9, 2018

zheng-da commented Jul 10, 2018

al-rigazzi commented Sep 26, 2018

bgawrych commented Jul 6, 2021