Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented May 11, 2020

What changes were proposed in this pull request?

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0

    for (i <- 0 until splits.length) {
      if (splits(i) == -0.0) {
        splits(i) = 0.0
      }
    }

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

Manually test:

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.

Explain

scala 0.0 == -0.0 is True but 0.0.hashCode == -0.0.hashCode() is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))

Then you will see output like:

...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...

@SparkQA
Copy link

SparkQA commented May 11, 2020

Test build #122499 has finished for PR 28498 at commit 13eb249.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 11, 2020

This came up recently, but I couldn't figure out the cause. Array(0.0, -0.0).distinct has one value. Where is the different hash code used?

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented May 12, 2020

@srowen
This is because in scala 0.0 == -0.0 is True but 0.0.hashCode == -0.0.hashCode() is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

So that if array include 0.0 or -0.0, array.distinct result will be non-deterministic.

I also try java behavior (also wrong) :
new java.lang.Double(0.0) == new java.lang.Double(-0.0) ---> True

new java.lang.Double(0.0).hashCode == new java.lang.Double(-0.0).hashCode --> False

@srowen
Copy link
Member

srowen commented May 12, 2020

I can't reproduce that though - that's what I'm saying. I made a large array of random doubles and a 0.0 and -0.0 and it always finds they're the same, returning an array containing 0.0 and not -0.0. I wonder if it's a Scala version thing.

The behavior you show is correct, but I am not sure it's relevant. How do you reproduce the array distinct behavior you mention? I accept something is going on so this is probably OK as a change, just trying to understand where it happens.

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented May 12, 2020

@srowen You can try this:

import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))

Then you will see output like:

...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...

Test on scala 2.12

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented May 12, 2020

I also file a bug report in scala:
scala/bug#11995

@SparkQA
Copy link

SparkQA commented May 12, 2020

Test build #122528 has finished for PR 28498 at commit 00959e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 12, 2020

OK I buy it, thanks. My guess is that the behavior is different in Scala 2.13 as it changed how it handles floating-point maybe (IEEE vs total ordering). Maybe. I don't think it is necessarily a bug; they are different doubles, but they are equal, and there are two rational outcomes there which Scala 2.13 now lets you choose.

I think the fix is fine.

@SparkQA
Copy link

SparkQA commented May 13, 2020

Test build #122581 has finished for PR 28498 at commit 13df5aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2020

Test build #122588 has finished for PR 28498 at commit 00b9b7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in b2300fc May 14, 2020
srowen pushed a commit that referenced this pull request May 14, 2020
…ven invalid value (splits array includes -0.0 and 0.0)

### What changes were proposed in this pull request?

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
    for (i <- 0 until splits.length) {
      if (splits(i) == -0.0) {
        splits(i) = 0.0
      }
    }
```
### Why are the changes needed?
Fix bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

#### Manually test:

~~~scala
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.
~~~

### Explain
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct
```
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```

Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```

Closes #28498 from WeichenXu123/SPARK-31676.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b2300fc)
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request May 14, 2020
…ven invalid value (splits array includes -0.0 and 0.0)

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
    for (i <- 0 until splits.length) {
      if (splits(i) == -0.0) {
        splits(i) = 0.0
      }
    }
```
Fix bug.

No

Unit test.

~~~scala
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.
~~~

scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct
```
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```

Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```

Closes #28498 from WeichenXu123/SPARK-31676.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b2300fc)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented May 14, 2020

Merged to master/3.0/2.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants