Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_operator_gpu.test_countsketch #10988

Closed
eric-haibin-lin opened this issue May 17, 2018 · 11 comments
Closed

Flaky test: test_operator_gpu.test_countsketch #10988

eric-haibin-lin opened this issue May 17, 2018 · 11 comments

Comments

@eric-haibin-lin
Copy link
Member

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10983/1/pipeline/

test_operator_gpu.test_exc_multiple_waits ... ok (0.0126s)

**test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=0 to reproduce.**

**FAIL**

test_operator_gpu.test_cached ... SKIP: test fails intermittently. temporarily disabled till it gets fixed. tracked at https://github.com/apache/incubator-mxnet/issues/8049

test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=584830846 to reproduce.

ERROR

test_operator_gpu.test_residual ... ERROR

test_operator_gpu.test_lstm_dropout ... ok (0.0739s)

test_operator_gpu.test_parameter_sharing ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1302250019 to reproduce.

ERROR

test_operator_gpu.test_exc_post_fail ... ok (0.0255s)

test_operator_gpu.test_bce_equal_ce2 ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2007597932 to reproduce.

ERROR

test_operator_gpu.test_exc_mutable_var_fail ... ok (0.0051s)

test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=28702949 to reproduce.

ERROR

test_operator_gpu.test_logistic_loss_equal_bce ... ERROR

test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=0 to reproduce.

ERROR

test_operator_gpu.test_residual_bidirectional ... ERROR

test_operator_gpu.test_kl_loss ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.

ERROR

test_operator_gpu.test_parameter_str ... ok (0.0005s)

test_operator_gpu.test_ndarray_elementwisesum ... [06:12:50] src/operator/tensor/./.././../common/../operator/mxnet_op.h:576: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

/work/runtime_functions.sh: line 507:     7 Aborted                 (core dumped) python3.6 -m "nose" --with-timer --verbose tests/python/gpu


@srochel
Copy link
Contributor

srochel commented Jun 7, 2018

We should make a focussed effort to resolve all flaky tests in Q3 2018.

@aaronmarkham
Copy link
Contributor

I think I hit this same issue.

test check_format for sparse ndarray ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1329389823 to reproduce.

FAIL

test_operator_gpu.test_trunc ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=367168103 to reproduce.

ERROR

test_operator_gpu.test_output ... [22:41:32] src/operator/tensor/./.././../common/../operator/mxnet_op.h:576: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11157/5/pipeline

@haojin2
Copy link
Contributor

haojin2 commented Jun 26, 2018

Was able to reproduce the issue with a different seed, the problem is that the absolute tolerance is very low (1e-12), I've bumped it up to 1e-5 to see if it could pass 1000 runs.

@anirudh2290
Copy link
Member

assigned to haibin @haojin2 is working on this.

@haojin2
Copy link
Contributor

haojin2 commented Jun 29, 2018

From the reproduced error we can see that only part of the grad ndarray is filled:

======================================================================
FAIL: test_operator_gpu.test_countsketch
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/5-mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/5-mxnet/tests/python/gpu/test_operator_gpu.py", line 103, in test_countsketch
    check_countsketch(in_dim, out_dim, n)
  File "/home/ubuntu/5-mxnet/tests/python/gpu/test_operator_gpu.py", line 88, in check_countsketch
    assert_almost_equal(a,arr_grad[0].asnumpy(),rtol=1e-3, atol=1e-5)
  File "/home/ubuntu/6-mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 159853.341227 exceeds tolerance rtol=0.001000, atol=0.000010.  Location of maximum error:(5, 9), a=6.016212, b=-0.027810
 a: array([[-0.08690866,  0.        , -0.        , ...,  0.        ,
         0.        , -0.        ],
       [ 0.        ,  0.        ,  0.        , ..., -0.        ,...
 b: array([[-0.08690866, -3.90360618, -1.36067092, ..., -0.4085128 ,
        -2.49076152,  0.51365918],
       [ 9.16543007, -3.62473965, -6.45960188, ..., -1.51162243,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=954783568 to reproduce.
common: INFO: 1 of 10000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1220294681 to reproduce.
common: INFO: 2 of 10000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1220294681 to reproduce.
common: INFO: 3 of 10000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1220294681 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 5.961s

@haojin2
Copy link
Contributor

haojin2 commented Jul 18, 2018

Fix in #11780.

@haojin2
Copy link
Contributor

haojin2 commented Jul 26, 2018

@larroy Please take a look at my comments on how the test could fail without a sync.

@anirudh2290
Copy link
Member

@larroy
Copy link
Contributor

larroy commented Jun 20, 2019

But all of them are failing, might not be specific to this one, seems memory corruption or hardware issue.

@leezu
Copy link
Contributor

leezu commented Jul 23, 2020

[2020-07-23T03:30:14.821Z] tests/python/gpu/test_numpy_fallback.py::test_np_fallback_decorator PASSED [ 19%]
[2020-07-23T03:30:16.176Z] tests/python/gpu/test_operator_gpu.py::test_countsketch FAILED           [ 20%]
[2020-07-23T03:30:16.733Z] tests/python/gpu/test_operator_gpu.py::test_multi_sum_sq FAILED          [ 20%]
[2020-07-23T03:30:17.657Z] tests/python/gpu/test_operator_gpu.py::test_fast_lars FAILED             [ 21%]
[2020-07-23T03:30:18.214Z] tests/python/gpu/test_operator_gpu.py::test_batchnorm_with_type FAILED   [ 21%]
[2020-07-23T03:30:18.770Z] tests/python/gpu/test_operator_gpu.py::test_batchnorm_versions FAILED    [ 22%]
[2020-07-23T03:30:19.327Z] tests/python/gpu/test_operator_gpu.py::test_convolution_with_type FAILED [ 22%]
[2020-07-23T03:30:19.327Z] tests/python/gpu/test_operator_gpu.py::test_convolution_options SKIPPED  [ 23%]
[2020-07-23T03:30:20.248Z] tests/python/gpu/test_operator_gpu.py::test_conv_deconv_guards FAILED    [ 23%]
[2020-07-23T03:30:20.501Z] tests/python/gpu/test_operator_gpu.py::test_convolution_large_c FAILED   [ 24%]
[2020-07-23T03:30:21.058Z] tests/python/gpu/test_operator_gpu.py::test_deconvolution_large_c FAILED [ 24%]
[2020-07-23T03:30:21.615Z] tests/python/gpu/test_operator_gpu.py::test_convolution_versions FAILED  [ 25%]
[2020-07-23T03:30:21.869Z] tests/python/gpu/test_operator_gpu.py::test_pooling_nhwc_with_convention FAILED [ 25%]
[2020-07-23T03:30:22.425Z] tests/python/gpu/test_operator_gpu.py::test_pooling_with_type FAILED     [ 26%]
[2020-07-23T03:30:22.679Z] tests/python/gpu/test_operator_gpu.py::test_deconvolution_with_type FAILED [ 26%]
[2020-07-23T03:30:23.236Z] tests/python/gpu/test_operator_gpu.py::test_deconvolution_options FAILED [ 27%]
[2020-07-23T03:30:23.793Z] tests/python/gpu/test_operator_gpu.py::test_pooling_versions FAILED      [ 27%]
[2020-07-23T03:30:24.350Z] tests/python/gpu/test_operator_gpu.py::test_flatten_slice_after_conv FAILED [ 28%]
[2020-07-23T03:30:24.604Z] tests/python/gpu/test_operator_gpu.py::test_global_pooling FAILED        [ 28%]
[2020-07-23T03:30:25.160Z] tests/python/gpu/test_operator_gpu.py::test_psroipooling_with_type FAILED [ 29%]
[2020-07-23T03:30:25.717Z] tests/python/gpu/test_operator_gpu.py::test_deformable_psroipooling_with_type FAILED [ 29%]
[2020-07-23T03:30:26.274Z] tests/python/gpu/test_operator_gpu.py::test_deformable_convolution_with_type FAILED [ 30%]
[2020-07-23T03:30:27.195Z] tests/python/gpu/test_operator_gpu.py::test_sequence_reverse FAILED      [ 30%]
[2020-07-23T03:30:27.195Z] tests/python/gpu/test_operator_gpu.py::test_autograd_save_memory FAILED  [ 31%]
[2020-07-23T03:30:27.752Z] tests/python/gpu/test_operator_gpu.py::test_cuda_rtc FAILED              [ 31%]
[2020-07-23T03:30:28.005Z] tests/python/gpu/test_operator_gpu.py::test_cross_device_autograd FAILED [ 32%]
[2020-07-23T03:30:28.562Z] tests/python/gpu/test_operator_gpu.py::test_multi_proposal_op FAILED     [ 32%]
[2020-07-23T03:30:28.562Z] Fatal Python error: Aborted
[2020-07-23T03:30:28.562Z] 
[2020-07-23T03:30:28.562Z] Current thread 0x00007f7922445740 (most recent call first):
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/util.py", line 417 in spawnv_passfds
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/semaphore_tracker.py", line 71 in ensure_running
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/semaphore_tracker.py", line 35 in getfd
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 40 in _launch
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32 in __init__
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/context.py", line 284 in _Popen
[2020-07-23T03:30:28.562Z]   File "/usr/lib/python3.6/multiprocessing/process.py", line 105 in start
[2020-07-23T03:30:28.562Z]   File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2042 in test_kernel_error_checking
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 129 in call_and_report
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 92 in pytest_runtest_protocol
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-23T03:30:28.562Z]   File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main
[2020-07-23T03:30:28.562Z]   File "/usr/local/bin/pytest", line 11 in <module>
[2020-07-23T03:30:29.483Z] /work/runtime_functions.sh: line 886:  8947 Aborted                 (core dumped) pytest -m 'serial' --durations=50 --cov-report xml:tests_gpu.xml --cov-append --verbose tests/python/gpu

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-18771/runs/2/nodes/310/steps/340/log/?start=0

@DickJC123
Copy link
Contributor

Fixed in a 'tack on' commit to #20876.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants