-
Notifications
You must be signed in to change notification settings - Fork 6.8k
A potential race condition in the executor or engine. #10865
Comments
which operator is this? Any op registered with |
The problem I observed is in convolution. Its input shouldn't be modified. Even if the inputs are mutable, modifying the output array of an operator shouldn't affect the inputs of another operator. |
@zheng-da ran the above code on the latest master and was not able to reproduce issue. can you see if this is still an issue? |
@azai91 the problem is kind of fixed. But the fix just solves the problem in this test. there exists a race condition somewhere. I still can't figure out where. |
@zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times. I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem? Thanks, |
@szha I can't reproduce this issue - I think we can close this one as outdated and not reproducible |
Previously, we encounter a memory error. It was caused by a race condition that the MKLDNN memory in an output NDArray was removed when some MKLDNN operator tried to read the MKLDNN memory from its input arrays. The error was temporarily fixed in #10651
This error can be reproduced in the following command:
However, the race condition shouldn't happen. The execution engine schedules the execution of computation based on the data dependency. When an operator is scheduled to write data to an output NDArray, any operator that reads data from the NDArray shouldn't be scheduled for execution. But we actually observe that the input array of an operator is modified when the operator is running, which suggests that the race condition can mess up data in the input NDArray even without MKLDNN, but it's harder to notice.
The text was updated successfully, but these errors were encountered: