Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

Open
szha opened this issue Mar 5, 2019 · 35 comments
Open

[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

szha opened this issue Mar 5, 2019 · 35 comments

Comments

@szha
Copy link
Member

szha commented Mar 5, 2019

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Flaky

@piyushghai
Copy link
Contributor

@mxnet-label-bot Add [Test, Flaky]

@DickJC123
Copy link
Contributor

Looking at this now. The problem occurs during shutdown of the system when MXNET_ENGINE_TYPE=ThreadedEngine (so the ThreadedEnginePooled). I'll first try to determine if it's related to the dual stream work, or whether it's just that the dual-stream PR, in expanding the meager testing of the ThreadedEnginePooled, uncovered an issue with it.

@DickJC123
Copy link
Contributor

I believe I've isolated the problem. I started by checking out the commit that introduced the dual stream facility and test (5f32f32). I could run the test 20 times without failure. I then examined behavior under the other 19 commits to master that were made since then. The problem behavior first appears with:

053ffc7f8  2019-03-03  Wang Jiajun            fix memory-related issues to enable ASAN tests (#14223)

I ran the test_convolution_multiple_streams test with the engine list pruned down to just 'ThreadedEngine' (so ThreadedEnginePooled) for speed of testing. It failed with the /usr/bin/python3': double free or corruption (out) abort at the 6th iteration. I then backed up my testing to the previous commit:

0e23a18da  2019-03-02  Yuxi Hu                add symbolic link to mkldnn header files in include (#14300)

...and saw no failures after 40 repetitions of the full test_convolution_multiple_streams test. I even left the mshadow and dmlc-core versions the same as the 053ffc7 commit, which advances their versions.

In inspecting the change introduced by 053ffc7, I see changes to threaded_engine.cc as well as many free/delete operations added. From this I conclude that 053ffc7 is the likely culprit and is not compatible with the ThreadedEnginePooled. @szha @eric-haibin-lin @ptrendx

@junrushao
Copy link
Member

junrushao commented Mar 5, 2019

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-14192/runs/5/nodes/273/log/?start=0

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code 255 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=997087021 to reproduce.
--------------------- >> end captured logging << ---------------------

@szha
Copy link
Member Author

szha commented Mar 5, 2019

Thanks, @DickJC123. Does it reveal an actual problem or is it specific to the CI environment? I'd prefer to see this fixed instead of reverting ASAN enforcement as that change fixed a number of other problems.

cc @arcadiaphy

@DickJC123
Copy link
Contributor

To debug this further, I checked out the ASAN PR commit, reverted my dual-stream PR, then reinserted the test_operator_gpu.py file version from that PR. Then running the test_convolution_multiple_streams test multiple times generated segmentation violations and python 'double free' aborts (see below) on the ThreadedEngine only. The test passes solidly when the ThreadedEngine is omitted. I conclude that the dual-stream facility is not part of the issue, only that the PR came with a ThreadedEngine test that is perhaps more stressful than others in the CI.

@szha, if your preference is not to revert the ASAN commit, I could PR a change to the test_convolution_multiple_streams where it tests against only the NaiveEngine and ThreadedEnginePerDevice, and prints out that the ThreadedEngine is skipped. I could then file an issue to correct the ASAN PR for the ThreadedEngine that @arcadiaphy should respond to. The ThreadedEngine should be considered 'broken' until that is done.

OK?

[INFO] 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
FAIL

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code -11 from _conv_with_num_streams().
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=124093393 to reproduce.
common: INFO: 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
--------------------- >> end captured logging << ---------------------

After pruning the engine list to only include ThreadedEngine:

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams           
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=549472273 to reproduce.
test_operator_gpu.test_convolution_multiple_streams ... [22:27:47] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:50] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:52] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:55] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:56] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:59] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:01] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:04] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:05] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:08] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:10] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
*** Error in `/usr/bin/python': double free or corruption (out): 0x00007f80d04e77d0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f817c0897e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f817c09237a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f817c09653c]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager10ReleaseAllEv+0x2f1)[0x7f816b78cef1]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManagerD0Ev+0x1a)[0x7f816b79507a]
/opt/mxnet/lib/libmxnet.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x47)[0x7f81684d0f47]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet11StorageImpl10DirectFreeENS_7Storage6HandleE+0x84)[0x7f816b790ab4]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextEEZNS0_8resource19ResourceManagerImpl17ResourceTempSpaceILNS0_15ResourceRequest4Ty
peE1EED4EvEUlS1_E_E9_M_invokeERKSt9_Any_dataOS1_+0x10f)[0x7f816b8ca51f]
/opt/mxnet/lib/libmxnet.so(+0x40da8ba)[0x7f816b7768ba]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x956)[0x7f816b770c06]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN4dmlc11ManualEventEEEZN5mxnet6engine20ThreadedEnginePooled5StartEvEUlS3_E_E9_M_invoke
ERKSt9_Any_dataOS3_+0x17d)[0x7f816b77173d]

@szha
Copy link
Member Author

szha commented Mar 5, 2019

@DickJC123 this plan sounds good. Thanks.

@DickJC123
Copy link
Contributor

Rather than open up a new issue to track any follow-up PR from @arcadiaphy I thought it best to continue this one to keep the history.
My PR #14338 removes the ThreadedEngine from the engine list of the test_convolution_multiple_streams, but also enables easy testing of that problem case. After the PR is merged, I would recommend that @arcadiaphy test with:

MXNET_TEST_COUNT=50 MXNET_ENGINE_TYPE=ThreadedEngine nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

...to prove that the fixes he comes up with cure the present segfaults. Also, to ensure that the previous engines still work, I recommend:

MXNET_TEST_COUNT=50 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

@arcadiaphy
Copy link
Member

@DickJC123 OK. Let me find out the problem behind the threaded engine crash.

@DickJC123
Copy link
Contributor

I need to point out that a recently merged "op bulking" PR saw a segfault in either the NaiveEngine or the ThreadedEnginePerDevice. I could not reproduce this on my internal machines after hundreds of trials. I'm hoping that what @arcadiaphy learns in his follow-up work may have relevance past the ThreadedEngine problem. See: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14200/9/pipeline

@aaronmarkham
Copy link
Contributor

@nickguletskii
Copy link
Contributor

@kshitij12345
Copy link
Contributor

#16238

@ChaiBapchya
Copy link
Contributor

@kshitij12345 good find. Since it is the duplicate (and this one was opened first) should we close the other one? @szha or let both remain?

@TaoLv
Copy link
Member

TaoLv commented Jan 27, 2020

Happened again:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17313/6/pipeline

======================================================================

FAIL: test_operator_gpu.test_convolution_multiple_streams

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 215, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 770, in test_convolution_multiple_streams

    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 372, in run_in_spawned_process

    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)

AssertionError: Non-zero exit code 1 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1131332049 to reproduce.

--------------------- >> end captured logging << ---------------------

@ChaiBapchya
Copy link
Contributor

Another one
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17407/5/pipeline
#17407

@kshitij12345
Copy link
Contributor

Unable to recreate the failure locally with

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Feb 3, 2020

Unable to recreate the failure locally with

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

@kshitij12345
Ran the above command. Test passed

Finished engine ThreadedEnginePerDevice with 2 streams.
ok

----------------------------------------------------------------------
Ran 1 test in 958.201s

@ptrendx
Copy link
Member

ptrendx commented Feb 3, 2020

I don't think you will be able to repro this failure on its own unfortunately - I believe the problem with this test is that it is affected by other tests. What I believe happens is the other tests use a lot of GPU memory that is cached in the caching memory allocator. Then this tests (to test the effect of env variables that are read only during init of MXNet) launches a new MXNet process. That process then tries to allocate memory and hit OoM condition - this is not a problem for other tests, because they would just free the memory stored inside the caching allocator, but this process does not have that option, so it just fails, which is then reported as a failure.

I believe the proper way of fixing this would be to move this test to its own command during testing (so that there is no memory allocated by other tests when it runs).

@kshitij12345
Copy link
Contributor

@ChaiBapchya Thanks for confirming.
@ptrendx That makes sense. On digging a bit deeper, I found that test_bulking had similar failures. #14970

Both test_bulking and test_convolution_multiple_streams spawn a new process for the test.

Do you think it is a good idea to move both of them into a new test file?

@ptrendx
Copy link
Member

ptrendx commented Feb 3, 2020

They should be fine, yes, but I don't think the new test file is enough - it actually needs to be its own nosetests instance so that they are the only thing launched from that process.

@ChaiBapchya
Copy link
Contributor

@kshitij12345 since the fix is identified do you mind leading it? I can take it up otherwise. :)

@kshitij12345
Copy link
Contributor

@ChaiBapchya I'll be busy in the coming weeks. So it'll be great if you can fix this one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests