[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

szha · 2019-03-05T00:27:29Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14324/3/pipeline#step-682-log-1293

@DickJC123

mxnet-label-bot · 2019-03-05T00:27:32Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Flaky

piyushghai · 2019-03-05T00:28:44Z

@mxnet-label-bot Add [Test, Flaky]

DickJC123 · 2019-03-05T01:15:29Z

Looking at this now. The problem occurs during shutdown of the system when MXNET_ENGINE_TYPE=ThreadedEngine (so the ThreadedEnginePooled). I'll first try to determine if it's related to the dual stream work, or whether it's just that the dual-stream PR, in expanding the meager testing of the ThreadedEnginePooled, uncovered an issue with it.

DickJC123 · 2019-03-05T04:21:38Z

I believe I've isolated the problem. I started by checking out the commit that introduced the dual stream facility and test (5f32f32). I could run the test 20 times without failure. I then examined behavior under the other 19 commits to master that were made since then. The problem behavior first appears with:

053ffc7f8  2019-03-03  Wang Jiajun            fix memory-related issues to enable ASAN tests (#14223)

I ran the test_convolution_multiple_streams test with the engine list pruned down to just 'ThreadedEngine' (so ThreadedEnginePooled) for speed of testing. It failed with the /usr/bin/python3': double free or corruption (out) abort at the 6th iteration. I then backed up my testing to the previous commit:

0e23a18da  2019-03-02  Yuxi Hu                add symbolic link to mkldnn header files in include (#14300)

...and saw no failures after 40 repetitions of the full test_convolution_multiple_streams test. I even left the mshadow and dmlc-core versions the same as the 053ffc7 commit, which advances their versions.

In inspecting the change introduced by 053ffc7, I see changes to threaded_engine.cc as well as many free/delete operations added. From this I conclude that 053ffc7 is the likely culprit and is not compatible with the ThreadedEnginePooled. @szha @eric-haibin-lin @ptrendx

junrushao · 2019-03-05T18:53:38Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-14192/runs/5/nodes/273/log/?start=0

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code 255 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=997087021 to reproduce.
--------------------- >> end captured logging << ---------------------

szha · 2019-03-05T19:14:26Z

Thanks, @DickJC123. Does it reveal an actual problem or is it specific to the CI environment? I'd prefer to see this fixed instead of reverting ASAN enforcement as that change fixed a number of other problems.

cc @arcadiaphy

DickJC123 · 2019-03-05T22:31:44Z

To debug this further, I checked out the ASAN PR commit, reverted my dual-stream PR, then reinserted the test_operator_gpu.py file version from that PR. Then running the test_convolution_multiple_streams test multiple times generated segmentation violations and python 'double free' aborts (see below) on the ThreadedEngine only. The test passes solidly when the ThreadedEngine is omitted. I conclude that the dual-stream facility is not part of the issue, only that the PR came with a ThreadedEngine test that is perhaps more stressful than others in the CI.

@szha, if your preference is not to revert the ASAN commit, I could PR a change to the test_convolution_multiple_streams where it tests against only the NaiveEngine and ThreadedEnginePerDevice, and prints out that the ThreadedEngine is skipped. I could then file an issue to correct the ASAN PR for the ThreadedEngine that @arcadiaphy should respond to. The ThreadedEngine should be considered 'broken' until that is done.

OK?

[INFO] 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
FAIL

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code -11 from _conv_with_num_streams().
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=124093393 to reproduce.
common: INFO: 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
--------------------- >> end captured logging << ---------------------

After pruning the engine list to only include ThreadedEngine:

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams           
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=549472273 to reproduce.
test_operator_gpu.test_convolution_multiple_streams ... [22:27:47] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:50] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:52] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:55] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:56] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:59] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:01] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:04] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:05] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:08] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:10] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
*** Error in `/usr/bin/python': double free or corruption (out): 0x00007f80d04e77d0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f817c0897e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f817c09237a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f817c09653c]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager10ReleaseAllEv+0x2f1)[0x7f816b78cef1]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManagerD0Ev+0x1a)[0x7f816b79507a]
/opt/mxnet/lib/libmxnet.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x47)[0x7f81684d0f47]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet11StorageImpl10DirectFreeENS_7Storage6HandleE+0x84)[0x7f816b790ab4]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextEEZNS0_8resource19ResourceManagerImpl17ResourceTempSpaceILNS0_15ResourceRequest4Ty
peE1EED4EvEUlS1_E_E9_M_invokeERKSt9_Any_dataOS1_+0x10f)[0x7f816b8ca51f]
/opt/mxnet/lib/libmxnet.so(+0x40da8ba)[0x7f816b7768ba]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x956)[0x7f816b770c06]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN4dmlc11ManualEventEEEZN5mxnet6engine20ThreadedEnginePooled5StartEvEUlS3_E_E9_M_invoke
ERKSt9_Any_dataOS3_+0x17d)[0x7f816b77173d]

szha · 2019-03-05T22:37:02Z

@DickJC123 this plan sounds good. Thanks.

DickJC123 · 2019-03-06T00:23:42Z

Rather than open up a new issue to track any follow-up PR from @arcadiaphy I thought it best to continue this one to keep the history.
My PR #14338 removes the ThreadedEngine from the engine list of the test_convolution_multiple_streams, but also enables easy testing of that problem case. After the PR is merged, I would recommend that @arcadiaphy test with:

MXNET_TEST_COUNT=50 MXNET_ENGINE_TYPE=ThreadedEngine nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

...to prove that the fixes he comes up with cure the present segfaults. Also, to ensure that the previous engines still work, I recommend:

MXNET_TEST_COUNT=50 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

arcadiaphy · 2019-03-06T06:09:23Z

@DickJC123 OK. Let me find out the problem behind the threaded engine crash.

DickJC123 · 2019-03-07T18:00:57Z

I need to point out that a recently merged "op bulking" PR saw a segfault in either the NaiveEngine or the ThreadedEnginePerDevice. I could not reproduce this on my internal machines after hundreds of trials. I'm hoping that what @arcadiaphy learns in his follow-up work may have relevance past the ThreadedEngine problem. See: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14200/9/pipeline

aaronmarkham · 2019-05-02T00:10:00Z

Failed for me today: http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-gpu/job/PR-14855/5/display/redirect

nickguletskii · 2019-05-02T16:46:08Z

Failed for me as well: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14638/6/

aaronmarkham · 2019-09-20T00:08:19Z

Failed again! http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15883/40/pipeline/302

kshitij12345 · 2020-01-23T15:55:16Z

#16238

ChaiBapchya · 2020-01-24T08:59:38Z

@kshitij12345 good find. Since it is the duplicate (and this one was opened first) should we close the other one? @szha or let both remain?

TaoLv · 2020-01-27T02:59:32Z

Happened again:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17313/6/pipeline

======================================================================

FAIL: test_operator_gpu.test_convolution_multiple_streams

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 215, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 770, in test_convolution_multiple_streams

    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 372, in run_in_spawned_process

    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)

AssertionError: Non-zero exit code 1 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1131332049 to reproduce.

--------------------- >> end captured logging << ---------------------

szha · 2020-01-28T04:33:44Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17453/1/pipeline#step-100-log-1526

kshitij12345 · 2020-01-28T05:06:55Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/14/pipeline#step-100-log-1524

kshitij12345 · 2020-01-28T16:47:13Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/15/pipeline#step-100-log-1523

kshitij12345 · 2020-01-29T17:38:15Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/17/pipeline#step-100-log-1526

ChaiBapchya · 2020-01-30T05:26:21Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17444/3/pipeline
#17444

ChaiBapchya · 2020-01-30T18:53:06Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17482/1/pipeline
#17482

ChaiBapchya · 2020-01-30T23:54:41Z

Another one
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17407/5/pipeline
#17407

ChaiBapchya · 2020-01-31T23:22:03Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17494/2/pipeline #17494

kshitij12345 · 2020-02-01T09:55:42Z

Unable to recreate the failure locally with

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

kshitij12345 · 2020-02-02T07:44:30Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17325/23/pipeline#step-875-log-1043

ChaiBapchya · 2020-02-03T07:13:56Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17508/2/pipeline/

#17508

ChaiBapchya · 2020-02-03T07:49:34Z

Unable to recreate the failure locally with
MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams
Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

@kshitij12345
Ran the above command. Test passed

Finished engine ThreadedEnginePerDevice with 2 streams.
ok

----------------------------------------------------------------------
Ran 1 test in 958.201s

ptrendx · 2020-02-03T16:46:53Z

I don't think you will be able to repro this failure on its own unfortunately - I believe the problem with this test is that it is affected by other tests. What I believe happens is the other tests use a lot of GPU memory that is cached in the caching memory allocator. Then this tests (to test the effect of env variables that are read only during init of MXNet) launches a new MXNet process. That process then tries to allocate memory and hit OoM condition - this is not a problem for other tests, because they would just free the memory stored inside the caching allocator, but this process does not have that option, so it just fails, which is then reported as a failure.

I believe the proper way of fixing this would be to move this test to its own command during testing (so that there is no memory allocated by other tests when it runs).

kshitij12345 · 2020-02-03T18:43:52Z

@ChaiBapchya Thanks for confirming.
@ptrendx That makes sense. On digging a bit deeper, I found that test_bulking had similar failures. #14970

Both test_bulking and test_convolution_multiple_streams spawn a new process for the test.

Do you think it is a good idea to move both of them into a new test file?

ptrendx · 2020-02-03T18:58:40Z

They should be fine, yes, but I don't think the new test file is enough - it actually needs to be its own nosetests instance so that they are the only thing launched from that process.

ChaiBapchya · 2020-02-03T22:09:02Z

@kshitij12345 since the fix is identified do you mind leading it? I can take it up otherwise. :)

kshitij12345 · 2020-02-04T15:16:09Z

@ChaiBapchya I'll be busy in the coming weeks. So it'll be great if you can fix this one.

ChaiBapchya · 2020-02-06T23:35:51Z

Another one - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17475/8/pipeline #17475

marcoabreu added Flaky Test labels Mar 5, 2019

piyushghai mentioned this issue Mar 5, 2019

[Flaky] test_operator_gpu.test_convolution_multiple_streams #14289

Closed

DickJC123 mentioned this issue Mar 6, 2019

Bypass ThreadedEngine in test_convolution_multiple_streams. #14338

Merged

4 tasks

arcadiaphy mentioned this issue Mar 10, 2019

fix engine crash in shutdown phase #14382

Merged

7 tasks

nickguletskii mentioned this issue May 2, 2019

[MXNET-1382] Add the index_array operator #14638

Merged

10 tasks

szha mentioned this issue Jan 24, 2020

[Flaky] test_convolution_multiple_streams #16238

Closed

ChaiBapchya mentioned this issue Jan 30, 2020

Disable flaky test test_operator_gpu.test_convolution_multiple_streams #17484

Closed

4 tasks

[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

Comments

szha commented Mar 5, 2019

mxnet-label-bot commented Mar 5, 2019

piyushghai commented Mar 5, 2019

DickJC123 commented Mar 5, 2019

DickJC123 commented Mar 5, 2019

junrushao commented Mar 5, 2019 • edited Loading

szha commented Mar 5, 2019

DickJC123 commented Mar 5, 2019

szha commented Mar 5, 2019

DickJC123 commented Mar 6, 2019

arcadiaphy commented Mar 6, 2019

DickJC123 commented Mar 7, 2019

aaronmarkham commented May 2, 2019

nickguletskii commented May 2, 2019

aaronmarkham commented Sep 20, 2019

kshitij12345 commented Jan 23, 2020

ChaiBapchya commented Jan 24, 2020

TaoLv commented Jan 27, 2020

szha commented Jan 28, 2020

kshitij12345 commented Jan 28, 2020

kshitij12345 commented Jan 28, 2020

kshitij12345 commented Jan 29, 2020

ChaiBapchya commented Jan 30, 2020

ChaiBapchya commented Jan 30, 2020

ChaiBapchya commented Jan 30, 2020

ChaiBapchya commented Jan 31, 2020

kshitij12345 commented Feb 1, 2020

kshitij12345 commented Feb 2, 2020

ChaiBapchya commented Feb 3, 2020

ChaiBapchya commented Feb 3, 2020 • edited Loading

ptrendx commented Feb 3, 2020

kshitij12345 commented Feb 3, 2020

ptrendx commented Feb 3, 2020

ChaiBapchya commented Feb 3, 2020

kshitij12345 commented Feb 4, 2020

ChaiBapchya commented Feb 6, 2020

junrushao commented Mar 5, 2019 •

edited

Loading

ChaiBapchya commented Feb 3, 2020 •

edited

Loading