[SPARK-13568] [ML] Create feature transformer to impute missing values by hhbyyh · Pull Request #11601 · apache/spark

hhbyyh · 2016-03-09T08:01:04Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-13568
It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).

Currently this PR supports imputation for Double and Vector (null and NaN in Vector).

How was this patch tested?

new unit tests and manual test

SparkQA · 2016-03-09T08:40:31Z

Test build #52734 has finished for PR 11601 at commit 1b39668.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-03-09T16:55:47Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+  val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " +
+    "If mean, then replace missing values using the mean along the axis." +
+    "If median, then replace missing values using the median along the axis." +
+    "If most, then replace missing using the most frequent value along the axis.")


Could you add a param validation function since there are a limited number of valid strategies? You can add an attribute like val supportedMissingValueStrategies = Set("mean", "median", "most") to the Imputer companion object like is done here

I added the validation to validateParameter. (which should be moved since it's the deprecated). Thanks for the suggestion. I'll add them.

sethah · 2016-03-09T18:15:31Z

Looking at the Jiras, it is unclear if any concrete decisions were made regarding handling Vectors and how NaN values should be handled in colStats. Is there any update?

hhbyyh · 2016-03-10T01:03:27Z

I prefer to keep Statistics.colStats(rdd) unchanged for now. As ut in this PR suggests, we can cover Double and Vector for now.

hhbyyh · 2016-03-10T19:23:31Z

@sethah @MLnick Thanks for helping with review. I made a pass according to the comments and add some more comments.

SparkQA · 2016-03-10T19:53:05Z

Test build #52842 has finished for PR 11601 at commit 4e45f81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-03-11T09:30:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+    val colStatistics = $(strategy) match {
+      case "mean" =>
+        filteredDF.selectExpr(s"avg($colName)").first().getDouble(0)
+      case "median" =>


I think we should favour using the new approxQuantile sql stat function here rather than computing exactly.

SparkQA · 2016-03-23T09:58:45Z

Test build #53923 has finished for PR 11601 at commit 1b36deb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T12:53:30Z

Test build #53931 has finished for PR 11601 at commit 72d104d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T07:52:34Z

Test build #73268 has started for PR 11601 at commit e86d919.

hhbyyh · 2017-02-22T17:14:07Z

Looks like CI was interrupted.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73268/console

MLnick · 2017-03-02T07:58:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+    /**


Fix comment indentation here.

MLnick · 2017-03-02T08:00:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+ * All Null values in the input column are treated as missing, and so are also imputed.
+ */
+@Experimental
+class Imputer @Since("2.1.0")(override val uid: String)


All @Since annotations -> 2.2.0

MLnick · 2017-03-02T08:00:55Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCols with HasOutputCol {


We don't use HasOutputCol anymore, correct?

Sure, however I didn't get your first comment. Do you mean we should remove the import?

MLnick · 2017-03-02T08:12:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+object Imputer extends DefaultParamsReadable[Imputer] {
+
+  /** Set of strategy names that Imputer currently supports. */
+  private[ml] val supportedStrategyNames = Set("mean", "median")


Could we factor out the mean and median names in to private[ml] val so to be used instead of the raw strings throughout?

Yes, that's better.

MLnick · 2017-03-02T08:18:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+        case "mean" => filtered.select(avg(inputCol)).first().getDouble(0)
+        case "median" => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001)(0)
+      }
+      surrogate.asInstanceOf[Double]


is the asInstanceOf[Double] necessary here?

no, will remove it.

MLnick · 2017-03-02T08:47:28Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+  test("ImputerModel read/write") {
+    val spark = this.spark
+    import spark.implicits._
+    val surrogateDF = Seq(1.234).toDF("myInputCol")


This should be "surrogate" col name - though I see we don't actually use it in load or transform

this happens to be the correct column name for now.

Ok - we should add a test here to check the column names of instance and newInstance match up? (The below check is just for the actual values of the surrogate, correct?

MLnick · 2017-03-02T09:07:37Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+    var outputDF = dataset
+    val surrogates = surrogateDF.head().getSeq[Double](0)
+
+    $(inputCols).indices.foreach { i =>


You could do $(inputCols).zip($(outputCols)).zip(surrogates).map { case ((inputCol, outputCol), icSurrogate) => ...

MLnick · 2017-03-02T09:10:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+    val localOutputCols = $(outputCols)
+    var outputSchema = schema
+
+    $(inputCols).indices.foreach { i =>


Can do $(inputCols).zip($(outputCols)).foreach { case (inputCol, outputCol) => ...

MLnick · 2017-03-02T09:52:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+      }
+      val surrogate = $(strategy) match {
+        case "mean" => filtered.select(avg(inputCol)).first().getDouble(0)
+        case "median" => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001)(0)


MLnick · 2017-03-02T10:02:01Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+ * Model fitted by [[Imputer]].
+ *
+ * @param surrogateDF Value by which missing values in the input columns will be replaced. This
+ *    is stored using DataFrame with input column names and the corresponding surrogates.


This is misleading - you're just storing the array of surrogates... did you mean something different? Otherwise the comment must be changed,

It sounds like you had the idea of storing the surrogates something like:

+------+---------+ |column|surrogate| +------+---------+ | col1| 1.2| | col2| 3.4| | col3| 5.4| +------+---------+

?

I refactored it a little for better extensibility.

inputCol1 inputCol2

surrogate1 surrogate2

MLnick · 2017-03-02T10:05:25Z

jenkins retest this please

SparkQA · 2017-03-02T10:59:42Z

Test build #73753 has finished for PR 11601 at commit e86d919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-02T22:27:32Z

Thanks a lot for making a pass @MLnick. The last update mainly focused on the interface and behavior change. I'll make a pass and also address your comments.

hhbyyh · 2017-03-03T22:57:44Z

Hi @MLnick I changed the surrogateDF format for better extensibility in the last update and added unit tests for multi-column support. Let me know if I miss anything.

inputCol1	inputCol2
surrogate1	surrogate2

SparkQA · 2017-03-03T23:51:14Z

Test build #73868 has finished for PR 11601 at commit 41d91b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class Imputer @Since(\"2.2.0\")(override val uid: String)

MLnick

Made a pass. A few minor comments.

MLnick · 2017-03-06T12:38:23Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the feature.
+   * If "median", then replace missing values using the approximate median value of the
+   * feature (relative error less than 0.001).


I think remove the part (relative error less than 0.001).

This can be moved to the overall ScalaDoc for Imputer at L95.

MLnick · 2017-03-06T12:39:55Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the mean or the median
+ * of the column in which the missing values are located. The input column should be of


As mentioned above at https://github.com/apache/spark/pull/11601/files#r104403880, you can add the note about relative error here.

Something like "For computing median, approxQuantile is used with a relative error of X" (provide a ScalaDoc link to approxQuantile).

I didn't add the link as it may break java doc generation.

Ah right - perhaps just mention using approxQuantile?

MLnick · 2017-03-06T12:41:05Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+  @Since("2.2.0")
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  import org.apache.spark.ml.feature.Imputer._


This import should probably be above with the others (or within fit)

MLnick · 2017-03-06T12:42:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+      }
+      val surrogate = $(strategy) match {
+        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
+        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head


Not really sure about the relative error here - perhaps 0.01 is sufficient?

Later perhaps we can even expose it as an expert param (but not for now)

I tried it before. 0.01 and 0.001 actually takes the same time for even a large dataset. Agree we can make it a param later.

MLnick · 2017-03-06T12:51:11Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+  override def transform(dataset: Dataset[_]): DataFrame = {
+    transformSchema(dataset.schema, logging = true)
+    var outputDF = dataset
+    val surrogates = surrogateDF.select($(inputCols).head, $(inputCols).tail: _*).head().toSeq


Maybe this is slightly cleaner: surrogateDF.select($(inputCols).map(col): _*)

MLnick · 2017-03-06T13:01:54Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+        .setInputCols(Array("value1", "value2"))
+        .setOutputCols(Array("out1"))
+        .setStrategy(strategy)
+      intercept[IllegalArgumentException] {


Also test for thrown message here and withClue

MLnick · 2017-03-06T13:03:57Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+  test("ImputerModel read/write") {
+    val spark = this.spark
+    import spark.implicits._
+    val surrogateDF = Seq(1.234).toDF("myInputCol")


Ok - we should add a test here to check the column names of instance and newInstance match up? (The below check is just for the actual values of the surrogate, correct?

MLnick · 2017-03-06T13:04:11Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+
+}
+
+object ImputerSuite{


space before {

MLnick · 2017-03-06T13:11:56Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+    Seq("mean", "median").foreach { strategy =>
+      val imputer = new Imputer().setInputCols(Array("value")).setOutputCols(Array("out"))
+        .setStrategy(strategy)
+      intercept[SparkException] {


Check message here also.

MLnick · 2017-03-06T13:13:06Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+    )).toDF("id", "value1", "value2", "value3")
+    Seq("mean", "median").foreach { strategy =>
+      // inputCols and outCols length different
+      val imputer = new Imputer()


You can also perhaps use withClue to put a message for the subtest / exception assertion (e.g. withClue("Imputer should fail if inputCols and outputCols are different length")

SparkQA · 2017-03-06T22:17:22Z

Test build #74038 has finished for PR 11601 at commit e378db5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-07T10:50:29Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

 * Note that the mean/median value is computed after filtering out missing values.
- * All Null values in the input column are treated as missing, and so are also imputed.
+ * All Null values in the input column are treated as missing, and so are also imputed. For
+ * computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.


Ah I see it is here - nevermind

MLnick · 2017-03-08T10:36:50Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+      val ic = col(inputCol)
+      val filtered = dataset.select(ic.cast(DoubleType))
+        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
+      if(filtered.rdd.isEmpty()) {


I think we can do filtered.take(1).size == 0 which should be more efficient

MLnick · 2017-03-08T10:38:29Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

+        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
+      if(filtered.rdd.isEmpty()) {
+        throw new SparkException(s"surrogate cannot be computed. " +
+          s"All the values in $inputCol are Null, Nan or missingValue ($missingValue)")


($missingValue) -> ${$(missingValue)}?

MLnick · 2017-03-08T11:54:23Z

Made a few last comments. LGTM.

cc @sethah @jkbradley I am going to merge this for 2.2. Let me know if you have any final comments.

MLnick · 2017-03-08T12:02:21Z

By the way out of curiosity, I tested things out on a cluster (4x workers, 192 cores & 480GB RAM total), with 100 columns of 100 million doubles each, 1% NaN occurrence. Reading from a Parquet file.

not cached
fit takes about 1.5 seconds per column (150 secs), while transform takes 50 secs.

cached
fit: 15 sec; transform: 16 sec.

SparkQA · 2017-03-08T19:38:37Z

Test build #74216 has finished for PR 11601 at commit c67afc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-08T19:46:16Z

Thanks @MLnick for being the Shepherd and providing consistent help on discussion and review. The performance test matches what I got from my local environment.

MLnick · 2017-03-16T07:35:26Z

jenkins retest this please

MLnick · 2017-03-16T07:55:50Z

Created SPARK-19969 to track doc and examples to be done for 2.2 release. I can help with this if you're tied up.

SparkQA · 2017-03-16T08:30:00Z

Test build #74651 has finished for PR 11601 at commit c67afc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-16T10:49:57Z

Merged to master. Thanks @hhbyyh and also everyone for reviews.

hhbyyh added 10 commits February 29, 2016 09:46

initial commit for Imputer

2999b26

adjust mean and most

8335cf2

Merge remote-tracking branch 'upstream/master' into imputer

7be5e9b

Merge remote-tracking branch 'upstream/master' into imputer

131f7d5

Merge remote-tracking branch 'upstream/master' into imputer

a72a3ea

Merge remote-tracking branch 'upstream/master' into imputer

78df589

refine code and add ut

b949be5

Merge remote-tracking branch 'upstream/master' into imputer

79b1c62

minor change

c3d5d55

add object Imputer and ut refine

1b39668

sethah reviewed Mar 9, 2016
View reviewed changes

hhbyyh added 2 commits March 10, 2016 10:16

Merge remote-tracking branch 'upstream/master' into imputer

7f87ffb

add options validate and some small changes

4e45f81

MLnick reviewed Mar 11, 2016
View reviewed changes

hhbyyh added 3 commits March 22, 2016 12:00

Merge remote-tracking branch 'upstream/master' into imputer

e1dd0d2

Merge remote-tracking branch 'upstream/master' into imputer

12220eb

optimize mean for vectors

1b36deb

style fix

72d104d

MLnick suggested changes Mar 2, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into imputer

4f17c54

YY-OnCall added 2 commits March 3, 2017 14:37

Merge remote-tracking branch 'upstream/master' into imputer

ce59a5b

change surrogateDF format and add ut for multi-columns

41d91b9

MLnick suggested changes Mar 6, 2017

View reviewed changes

YY-OnCall added 2 commits March 6, 2017 12:42

Merge remote-tracking branch 'upstream/master' into imputer

9f6bd57

unit test refine and comments update

e378db5

MLnick reviewed Mar 7, 2017

View reviewed changes

MLnick reviewed Mar 8, 2017

View reviewed changes

fix exception message

c67afc1

asfgit closed this in d647aae Mar 16, 2017

sethah mentioned this pull request May 4, 2017

[SPARK-20604][ML] Allow imputer to handle numeric types #17864

Closed

Conversation

hhbyyh commented Mar 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Mar 9, 2016

Uh oh!

hhbyyh commented Mar 10, 2016

Uh oh!

hhbyyh commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

hhbyyh commented Feb 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Mar 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

hhbyyh commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhbyyh commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Mar 9, 2016 •

edited

Loading

hhbyyh Mar 3, 2017 •

edited

Loading

hhbyyh commented Mar 2, 2017 •

edited

Loading