Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

A potential race condition in the executor or engine. #10865

Closed
zheng-da opened this issue May 9, 2018 · 6 comments
Closed

A potential race condition in the executor or engine. #10865

zheng-da opened this issue May 9, 2018 · 6 comments

Comments

@zheng-da
Copy link
Contributor

zheng-da commented May 9, 2018

Previously, we encounter a memory error. It was caused by a race condition that the MKLDNN memory in an output NDArray was removed when some MKLDNN operator tried to read the MKLDNN memory from its input arrays. The error was temporarily fixed in #10651

This error can be reproduced in the following command:

export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export MXNET_MODULE_SEED=812478194
export MXNET_TEST_COUNT=10000
nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape

However, the race condition shouldn't happen. The execution engine schedules the execution of computation based on the data dependency. When an operator is scheduled to write data to an output NDArray, any operator that reads data from the NDArray shouldn't be scheduled for execution. But we actually observe that the input array of an operator is modified when the operator is running, which suggests that the race condition can mess up data in the input NDArray even without MKLDNN, but it's harder to notice.

@eric-haibin-lin
Copy link
Member

which operator is this? Any op registered with nnvm::FMutateInputs can modify inputs

@zheng-da
Copy link
Contributor Author

The problem I observed is in convolution. Its input shouldn't be modified.

Even if the inputs are mutable, modifying the output array of an operator shouldn't affect the inputs of another operator.

@azai91
Copy link
Contributor

azai91 commented Jul 9, 2018

@zheng-da ran the above code on the latest master and was not able to reproduce issue. can you see if this is still an issue?

@zheng-da
Copy link
Contributor Author

@azai91 the problem is kind of fixed. But the fix just solves the problem in this test. there exists a race condition somewhere. I still can't figure out where.

@al-rigazzi
Copy link

@zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times.

I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem?

Thanks,
Al

@bgawrych
Copy link
Contributor

bgawrych commented Jul 6, 2021

@szha I can't reproduce this issue - I think we can close this one as outdated and not reproducible

@leezu leezu closed this as completed Jul 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants