[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) #28498

WeichenXu123 · 2020-05-11T09:41:01Z

What changes were proposed in this pull request?

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0

    for (i <- 0 until splits.length) {
      if (splits(i) == -0.0) {
        splits(i) = 0.0
      }
    }

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

Manually test:

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.

Explain

scala 0.0 == -0.0 is True but 0.0.hashCode == -0.0.hashCode() is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))

Then you will see output like:

...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...

SparkQA · 2020-05-11T11:01:11Z

Test build #122499 has finished for PR 28498 at commit 13eb249.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-05-11T14:29:41Z

This came up recently, but I couldn't figure out the cause. Array(0.0, -0.0).distinct has one value. Where is the different hash code used?

WeichenXu123 · 2020-05-12T00:08:54Z

@srowen
This is because in scala 0.0 == -0.0 is True but 0.0.hashCode == -0.0.hashCode() is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

So that if array include 0.0 or -0.0, array.distinct result will be non-deterministic.

I also try java behavior (also wrong) :
new java.lang.Double(0.0) == new java.lang.Double(-0.0) ---> True
 new java.lang.Double(0.0).hashCode == new java.lang.Double(-0.0).hashCode --> False

srowen · 2020-05-12T00:16:21Z

I can't reproduce that though - that's what I'm saying. I made a large array of random doubles and a 0.0 and -0.0 and it always finds they're the same, returning an array containing 0.0 and not -0.0. I wonder if it's a Scala version thing.

The behavior you show is correct, but I am not sure it's relevant. How do you reproduce the array distinct behavior you mention? I accept something is going on so this is probably OK as a change, just trying to understand where it happens.

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

WeichenXu123 · 2020-05-12T00:22:53Z

@srowen You can try this:

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))

Then you will see output like:

...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...

Test on scala 2.12

WeichenXu123 · 2020-05-12T02:02:26Z

I also file a bug report in scala:
scala/bug#11995

SparkQA · 2020-05-12T05:12:04Z

Test build #122528 has finished for PR 28498 at commit 00959e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-05-12T13:06:31Z

OK I buy it, thanks. My guess is that the behavior is different in Scala 2.13 as it changed how it handles floating-point maybe (IEEE vs total ordering). Maybe. I don't think it is necessarily a bug; they are different doubles, but they are equal, and there are two rational outcomes there which Scala 2.13 now lets you choose.

I think the fix is fine.

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

SparkQA · 2020-05-13T11:18:42Z

Test build #122581 has finished for PR 28498 at commit 13df5aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T13:36:29Z

Test build #122588 has finished for PR 28498 at commit 00b9b7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ven invalid value (splits array includes -0.0 and 0.0) ### What changes were proposed in this pull request? In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. #### Manually test: ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ ### Explain scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit b2300fc) Signed-off-by: Sean Owen <[email protected]>

…ven invalid value (splits array includes -0.0 and 0.0) In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` Fix bug. No Unit test. ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit b2300fc) Signed-off-by: Sean Owen <[email protected]>

srowen · 2020-05-14T14:27:59Z

Merged to master/3.0/2.4

init

13eb249

probot-autolabeler bot added the ML label May 11, 2020

mengxr reviewed May 12, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala Outdated Show resolved Hide resolved

update comments

00959e1

mengxr reviewed May 12, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala Outdated Show resolved Hide resolved

WeichenXu123 added 2 commits May 13, 2020 16:35

update

13df5aa

update

00b9b7e

srowen approved these changes May 13, 2020

View reviewed changes

srowen closed this in b2300fc May 14, 2020

[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) #28498

[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) #28498

Uh oh!

Conversation

WeichenXu123 commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Manually test:

Explain

Uh oh!

SparkQA commented May 11, 2020

Uh oh!

srowen commented May 11, 2020

Uh oh!

WeichenXu123 commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

WeichenXu123 commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 12, 2020

Uh oh!

srowen commented May 12, 2020

Uh oh!

Uh oh!

SparkQA commented May 13, 2020

Uh oh!

SparkQA commented May 13, 2020

Uh oh!

srowen commented May 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WeichenXu123 commented May 11, 2020 •

edited

Loading

WeichenXu123 commented May 12, 2020 •

edited

Loading

srowen commented May 12, 2020 •

edited

Loading

WeichenXu123 commented May 12, 2020 •

edited

Loading

WeichenXu123 commented May 12, 2020 •

edited

Loading