-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot Add [Test, Flaky] |
Looking at this now. The problem occurs during shutdown of the system when MXNET_ENGINE_TYPE=ThreadedEngine (so the ThreadedEnginePooled). I'll first try to determine if it's related to the dual stream work, or whether it's just that the dual-stream PR, in expanding the meager testing of the ThreadedEnginePooled, uncovered an issue with it. |
I believe I've isolated the problem. I started by checking out the commit that introduced the dual stream facility and test (5f32f32). I could run the test 20 times without failure. I then examined behavior under the other 19 commits to master that were made since then. The problem behavior first appears with:
I ran the test_convolution_multiple_streams test with the engine list pruned down to just 'ThreadedEngine' (so ThreadedEnginePooled) for speed of testing. It failed with the
...and saw no failures after 40 repetitions of the full test_convolution_multiple_streams test. I even left the mshadow and dmlc-core versions the same as the 053ffc7 commit, which advances their versions. In inspecting the change introduced by 053ffc7, I see changes to threaded_engine.cc as well as many free/delete operations added. From this I conclude that 053ffc7 is the likely culprit and is not compatible with the ThreadedEnginePooled. @szha @eric-haibin-lin @ptrendx |
|
Thanks, @DickJC123. Does it reveal an actual problem or is it specific to the CI environment? I'd prefer to see this fixed instead of reverting ASAN enforcement as that change fixed a number of other problems. cc @arcadiaphy |
To debug this further, I checked out the ASAN PR commit, reverted my dual-stream PR, then reinserted the test_operator_gpu.py file version from that PR. Then running the test_convolution_multiple_streams test multiple times generated segmentation violations and python 'double free' aborts (see below) on the ThreadedEngine only. The test passes solidly when the ThreadedEngine is omitted. I conclude that the dual-stream facility is not part of the issue, only that the PR came with a ThreadedEngine test that is perhaps more stressful than others in the CI. @szha, if your preference is not to revert the ASAN commit, I could PR a change to the test_convolution_multiple_streams where it tests against only the NaiveEngine and ThreadedEnginePerDevice, and prints out that the ThreadedEngine is skipped. I could then file an issue to correct the ASAN PR for the ThreadedEngine that @arcadiaphy should respond to. The ThreadedEngine should be considered 'broken' until that is done. OK?
After pruning the engine list to only include ThreadedEngine:
|
@DickJC123 this plan sounds good. Thanks. |
Rather than open up a new issue to track any follow-up PR from @arcadiaphy I thought it best to continue this one to keep the history.
...to prove that the fixes he comes up with cure the present segfaults. Also, to ensure that the previous engines still work, I recommend:
|
@DickJC123 OK. Let me find out the problem behind the threaded engine crash. |
I need to point out that a recently merged "op bulking" PR saw a segfault in either the NaiveEngine or the ThreadedEnginePerDevice. I could not reproduce this on my internal machines after hundreds of trials. I'm hoping that what @arcadiaphy learns in his follow-up work may have relevance past the ThreadedEngine problem. See: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14200/9/pipeline |
@kshitij12345 good find. Since it is the duplicate (and this one was opened first) should we close the other one? @szha or let both remain? |
Happened again:
|
Unable to recreate the failure locally with
Can anyone try to see if they can reproduce it locally? @ChaiBapchya Could you also try it once? |
@kshitij12345
|
I don't think you will be able to repro this failure on its own unfortunately - I believe the problem with this test is that it is affected by other tests. What I believe happens is the other tests use a lot of GPU memory that is cached in the caching memory allocator. Then this tests (to test the effect of env variables that are read only during init of MXNet) launches a new MXNet process. That process then tries to allocate memory and hit OoM condition - this is not a problem for other tests, because they would just free the memory stored inside the caching allocator, but this process does not have that option, so it just fails, which is then reported as a failure. I believe the proper way of fixing this would be to move this test to its own command during testing (so that there is no memory allocated by other tests when it runs). |
@ChaiBapchya Thanks for confirming. Both Do you think it is a good idea to move both of them into a new test file? |
They should be fine, yes, but I don't think the new test file is enough - it actually needs to be its own |
@kshitij12345 since the fix is identified do you mind leading it? I can take it up otherwise. :) |
@ChaiBapchya I'll be busy in the coming weeks. So it'll be great if you can fix this one. |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14324/3/pipeline#step-682-log-1293
@DickJC123
The text was updated successfully, but these errors were encountered: