Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix flaky test test_operator_gpu.test_spatial_transformer_with_type (… #11444

Merged
merged 1 commit into from
Jun 28, 2018

Conversation

ddavydenko
Copy link
Contributor

#7645)

Description

Modified test_operator_gpu.test_spatial_transformer_with_type test in order to remove its flakiness

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • removed predefined seed from the test
  • updated precision of comparison object from float32 to float64

Comments

I confirmed that original test code was causing test to consistently fail even with 100 iterations of running it with the error specified in original issue. After code modification I tested on 100,000 iterations and didn't notice single test failure.
PYTHONPATH=./python/ MXNET_TEST_COUNT=100000 nosetests —verbose tests/python/gpu/test_operator_gpu.py:test_spatial_transformer_with_type was used to run the test.

@marcoabreu
Copy link
Contributor

Hi, thanks a lot for fixing this test!

I'm a bit concerned that changing the data type leads to the code taking a different path and that the issue persists with float32. Could another reviewer please help out here, please? @szha @eric-haibin-lin

@szha
Copy link
Member

szha commented Jun 28, 2018

@marcoabreu this operator has type template in it so it's the same code with different type (and thus numeric precision)

@szha szha merged commit f7a0025 into apache:master Jun 28, 2018
@ddavydenko ddavydenko deleted the flaky-fix-7645 branch June 29, 2018 02:41
@anirudh2290
Copy link
Member

What is the reason for failure with float32 ?

@ddavydenko
Copy link
Contributor Author

Probably CO, but anyway... My guess is that there are some cumulative arithmetic operations happening where due to round up precision of the result deteriorates slowly. With increased precision of data type this loss is not able to accumulate fast enough to cause surpassing of the threshold of error when comparing results and test data.

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants