Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-564] Fix the Flaky test on arange #11377

Merged
merged 7 commits into from
Jun 29, 2018
Merged

Conversation

lanking520
Copy link
Member

@lanking520 lanking520 commented Jun 23, 2018

Description

@nswamy @yzhliu @andrewfayres @anirudh2290
Please see the fix for the issue: #10387

it may help this as well: #8383

After discussion with @andrewfayres , we found the problems was in the Scala section, the last value of the arange is larger than the stop value multiple numbers in the Array are different. We think this should be a bug of Scala as the function is depreciated in the log. As @reminisce recommended, we can increase the minimum value of the step since there is nobody use step in a E-05 level. In summary, there are several ways we can solve this problem:

  • Change the precision level of arange into E-03 and remove the last value of the array. Doesn't help, problem persist

  • Change the way that Scala works into While loops and set precision level to E-06 Doesn't help, problem still there...

  • Change the measurement from reid_diff into almost_equal which mostly used in Python with rtol = true and precision level E-04. Doesn't help, problem still there

  • Use BigDecimal and while loop to test, Passed!

  • Run multiple times with a fixed number in the start, stop and step since the target in here is to check arange is doing well.

We cannot reproduce the same issue with python after 1M tests done by @haojin2 . Python use numpy which takes Float as Double in their calculation with high accuracy. This PR represent as the ticked solution shown above. Currently, I keep 100000 runs in the code to make sure CI will pass as well. Will consider remove that when we merge into master.

Generally speaking, this issue is really rare to come out. I tried 100k and the issue is 70% reproducible. With the amended solution, I ran 10 times 100k and all of them passed. Since we already have a concrete Python test for this operator, consider remove this in the nearer future.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@lanking520 lanking520 requested a review from yzhliu as a code owner June 23, 2018 22:20
@lanking520
Copy link
Member Author

Source to the changes on the number:

  • We use toString on the number that passed to the backend, there should be minimum losses.
  • The number return type are NDArray, the toArray method brings this: DType.Float32 => units.map(wrapBytes(_).getFloat), this should have minimum impact to the result
  • Scala addition method already changed to while loop, but it doesn't help
  • Not only the last number in the Scala Array influence the accuracy, there are more.

Guys, do you think this problem is more like a JVM and C difference question?

@marcoabreu
Copy link
Contributor

Please also note that test_arange is failing in Python: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1053/pipeline/

Could indicate a problem with the underlying implementation.

@lanking520
Copy link
Member Author

Hi @marcoabreu I think this issue is more related to CUDA, please see the message shown below:

Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:unspecified launch failure

@marcoabreu
Copy link
Contributor

Oops, my bad! Thanks for pointing it out. Then it's already tracked at #11395.

@@ -30,4 +30,13 @@ object CheckUtils {
val norm: Float = a.reduce(Math.abs(_) + Math.abs(_))
diff / norm
}

def almost_equal(a: Array[Float], b: Array[Float],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere, can we bring it when necessary

@@ -233,8 +233,13 @@ class OperatorSuite extends FunSuite with BeforeAndAfterAll
val start = scala.util.Random.nextFloat() * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good find @lanking520 @andrewfayres, the precision is lost on the decimal part when it overflows with float. Declaring start, stop, step as doubles would just solve the problem. I would use just a double here instead of bigdecimal since its slow and unnecessary and this code pattern would soon start trickling to other tests. See the example that Andrew showed me with floats that doesn't work and max is very very close to upper bound with doubles.

scala> ( 3.969419d until 96.70541d by .00005197525d).flatMap(x=>Array.fill[Double](1)(x)).last
res3: Double = 96.70537523623696

@lanking520
Copy link
Member Author

@nswamy Thanks Naveen for your feedback, I have made the changes accordingly. Tested on local with 100k runs.

@szha szha removed their request for review June 28, 2018 23:33
@nswamy nswamy merged commit c84a2f6 into apache:master Jun 29, 2018
@lanking520 lanking520 deleted the arange-bugfix branch June 29, 2018 21:28
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Fix Flaky Test - Change float to double to increase the precision of numbers generated in scala
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants