-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) #28498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #122499 has finished for PR 28498 at commit
|
|
This came up recently, but I couldn't figure out the cause. |
|
@srowen So that if array include 0.0 or -0.0, array.distinct result will be non-deterministic. I also try java behavior (also wrong) : |
|
I can't reproduce that though - that's what I'm saying. I made a large array of random doubles and a 0.0 and -0.0 and it always finds they're the same, returning an array containing 0.0 and not -0.0. I wonder if it's a Scala version thing. The behavior you show is correct, but I am not sure it's relevant. How do you reproduce the array distinct behavior you mention? I accept something is going on so this is probably OK as a change, just trying to understand where it happens. |
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
|
@srowen You can try this: Then you will see output like: Test on scala 2.12 |
|
I also file a bug report in scala: |
|
Test build #122528 has finished for PR 28498 at commit
|
|
OK I buy it, thanks. My guess is that the behavior is different in Scala 2.13 as it changed how it handles floating-point maybe (IEEE vs total ordering). Maybe. I don't think it is necessarily a bug; they are different doubles, but they are equal, and there are two rational outcomes there which Scala 2.13 now lets you choose. I think the fix is fine. |
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
|
Test build #122581 has finished for PR 28498 at commit
|
|
Test build #122588 has finished for PR 28498 at commit
|
…ven invalid value (splits array includes -0.0 and 0.0)
### What changes were proposed in this pull request?
In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
for (i <- 0 until splits.length) {
if (splits(i) == -0.0) {
splits(i) = 0.0
}
}
```
### Why are the changes needed?
Fix bug.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
#### Manually test:
~~~scala
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")
import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)
val model = qd.fit(df1) // will raise error in spark master.
~~~
### Explain
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.
And array.distinct will rely on elem.hashCode so it leads to this error.
Test code on distinct
```
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```
Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```
Closes #28498 from WeichenXu123/SPARK-31676.
Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b2300fc)
Signed-off-by: Sean Owen <[email protected]>
…ven invalid value (splits array includes -0.0 and 0.0)
In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
for (i <- 0 until splits.length) {
if (splits(i) == -0.0) {
splits(i) = 0.0
}
}
```
Fix bug.
No
Unit test.
~~~scala
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")
import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)
val model = qd.fit(df1) // will raise error in spark master.
~~~
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.
And array.distinct will rely on elem.hashCode so it leads to this error.
Test code on distinct
```
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```
Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```
Closes #28498 from WeichenXu123/SPARK-31676.
Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b2300fc)
Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/3.0/2.4 |
What changes were proposed in this pull request?
In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
Why are the changes needed?
Fix bug.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test.
Manually test:
Explain
scala
0.0 == -0.0is True but0.0.hashCode == -0.0.hashCode()is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.And array.distinct will rely on elem.hashCode so it leads to this error.
Test code on distinct
Then you will see output like: