Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon probability flaky tests #18809

Open
ptrendx opened this issue Jul 28, 2020 · 7 comments · Fixed by #18817
Open

Gluon probability flaky tests #18809

ptrendx opened this issue Jul 28, 2020 · 7 comments · Fixed by #18817
Labels

Comments

@ptrendx
Copy link
Member

ptrendx commented Jul 28, 2020

Description

test_gluon_probability_v2.py::test_gluon_dirichlet fails with some seeds

Occurrences

http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/windows-gpu/job/PR-18622/23/display/redirect

What have you tried to solve it?

  1. Tried the nightly build (to eliminate the impact of the PR which showed the failure in CI) with the same seed:
MXNET_TEST_SEED=553606704 pytest -v -s test_gluon_probability_v2.py::test_gluon_dirichlet

Results in error:

E       AssertionError: 
E       Items are not equal:
E       Error nan exceeds tolerance rtol=1.000000e-03, atol=1.000000e-04.
E       
E        ACTUAL: array(nan, dtype=float32)
E        DESIRED: 142.17020331099576
@ptrendx
Copy link
Member Author

ptrendx commented Jul 28, 2020

@xidulu FYI

@xidulu
Copy link
Contributor

xidulu commented Jul 29, 2020

@ptrendx
Thanks for pointing out, looking into it.

@xidulu xidulu mentioned this issue Jul 29, 2020
7 tasks
@pengzhao-intel
Copy link
Contributor

Is this MKL build?

@xidulu
Copy link
Contributor

xidulu commented Jul 29, 2020

@pengzhao-intel
The flakiness is not related to MKLDNN, however the patch fails on the MKL build:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18817/1/pipeline/284

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel
The flakiness is not related to MKLDNN, however the patch fails on the MKL build:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18817/1/pipeline/284

maybe re-run can resolve the issue since I didn't see related problem till now.

@ptrendx
Copy link
Member Author

ptrendx commented Aug 3, 2020

@xidulu Seems it is not the only test in the gluon_probability_v1 suite that has this behavior of huge variance of results. Here is another failure, this time in test_gluon_negative_binomial_v1: https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/centos-cpu/branches/PR-18622/runs/23/nodes/175/steps/250/log/?start=0
The error:

[2020-08-03T23:01:27.335Z] >       raise AssertionError(msg)
[2020-08-03T23:01:27.335Z] E       AssertionError: 
[2020-08-03T23:01:27.335Z] E       Items are not equal:
[2020-08-03T23:01:27.335Z] E       Error 1.295010 exceeds tolerance rtol=1.000000e-03, atol=1.000000e-04 (mismatch 16.666667%).
[2020-08-03T23:01:27.335Z] E       Location of maximum error: (1, 2), a=708363968.00000000, b=709282496.00000000
[2020-08-03T23:01:27.335Z] E        ACTUAL: array([[4.0565004e+00, 9.8091322e-01, 3.6568069e+01],
[2020-08-03T23:01:27.335Z] E              [1.4324176e+01, 8.6093795e-01, 7.0836397e+08]], dtype=float32)
[2020-08-03T23:01:27.335Z] E        DESIRED: array([[4.05650234e+00, 9.80913043e-01, 3.65680428e+01],
[2020-08-03T23:01:27.335Z] E              [1.43241749e+01, 8.60937893e-01, 7.09282496e+08]])
[2020-08-03T23:01:27.335Z] 
[2020-08-03T23:01:27.335Z] python/mxnet/test_utils.py:735: AssertionError
[2020-08-03T23:01:27.335Z] ---------------------------- Captured stderr setup -----------------------------
[2020-08-03T23:01:27.335Z] DEBUG:root:np/mx/python random seeds are set to 72758289, use MXNET_TEST_SEED=72758289 to reproduce.

@xidulu
Copy link
Contributor

xidulu commented Aug 4, 2020

@ptrendx

Thanks for the comment, I believe the cause lies in the computation of NB's variance: p * r / (1 - p) ^ 2, which could extremely large when p -> 1. I will create a patch to restrict the range of p.

@xidulu xidulu reopened this Aug 4, 2020
@xidulu xidulu changed the title test_gluon_dirichlet intermittently fails on CPU Gluon probability flaky tests. Aug 4, 2020
@xidulu xidulu changed the title Gluon probability flaky tests. Gluon probability flaky tests Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants