[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)

WeichenXu123 · srowen · commit b2300fca1e1a · 2020-05-14T09:24:40.000-05:00
### What changes were proposed in this pull request? In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. #### Manually test: ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ ### Explain scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -236,6 +236,18 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui
   private def getDistinctSplits(splits: Array[Double]): Array[Double] = {
     splits(0) = Double.NegativeInfinity
     splits(splits.length - 1) = Double.PositiveInfinity
+
+    // 0.0 and -0.0 are distinct values, array.distinct will preserve both of them.
+    // but 0.0 > -0.0 is False which will break the parameter validation checking.
+    // and in scala <= 2.12, there's bug which will cause array.distinct generate
+    // non-deterministic results when array contains both 0.0 and -0.0
+    // So that here we should first normalize all 0.0 and -0.0 to be 0.0
+    // See https://github.com/scala/bug/issues/11995
+    for (i <- 0 until splits.length) {
+      if (splits(i) == -0.0) {
+        splits(i) = 0.0
+      }
+    }
     val distinctSplits = splits.distinct
     if (splits.length != distinctSplits.length) {
       log.warn(s"Some quantiles were identical. Bucketing to ${distinctSplits.length - 1}" +
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -512,4 +512,22 @@ class QuantileDiscretizerSuite extends MLTest with DefaultReadWriteTest {
     assert(observedNumBuckets === numBuckets,
       "Observed number of buckets does not equal expected number of buckets.")
   }
+
+  test("[SPARK-31676] QuantileDiscretizer raise error parameter splits given invalid value") {
+    import scala.util.Random
+    val rng = new Random(3)
+
+    val a1 = Array.tabulate(200)(_ => rng.nextDouble * 2.0 - 1.0) ++
+      Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
+
+    val df1 = sc.parallelize(a1, 2).toDF("id")
+
+    val qd = new QuantileDiscretizer()
+      .setInputCol("id")
+      .setOutputCol("out")
+      .setNumBuckets(200)
+      .setRelativeError(0.0)
+
+    qd.fit(df1) // assert no exception raised here.
+  }
 }