[SPARK-19969] [ML] Imputer doc and example by hhbyyh · Pull Request #17324 · apache/spark

hhbyyh · 2017-03-16T22:15:39Z

What changes were proposed in this pull request?

Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after #17316

How was this patch tested?

local doc generation and example execution

SparkQA · 2017-03-16T23:10:15Z

Test build #74689 has finished for PR 17324 at commit f2e7a69.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaImputerExample

SparkQA · 2017-03-16T23:15:02Z

Test build #74690 has finished for PR 17324 at commit ac0683b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-21T17:12:40Z

Will take a look this week - also we may want to add the Python example here once I merge #17316

MLnick · 2017-03-21T23:13:34Z

+
+By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from
+other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0
+and 4.0 respectively. After transformation, the output columns will not contain missing value anymore.


Perhaps "After transformation, the missing values in the output columns will be replaced by the surrogate value for that column"?

MLnick · 2017-03-21T23:16:02Z

+import org.apache.spark.ml.feature.Imputer
+// $example off$
+import org.apache.spark.sql.SparkSession
+


Most examples have a small doc string that includes a "Run with:" part - see e.g. the recent MinHashLSHExample (this should also be added for the Java example)

MLnick · 2017-03-21T23:16:43Z

+      .getOrCreate()
+
+    // $example on$
+    val df = spark.createDataFrame( Seq(


Nit: Space in ( Seq( should be removed

MLnick · 2017-03-21T23:18:07Z


  /** Validates and transforms the input schema. */
  protected def validateAndTransformSchema(schema: StructType): StructType = {
+    require(get(inputCols).isDefined, "Input cols must be defined first.")


As I mentioned in #17316, is this really required? Since a non-set param for these will in any case throw an exception during transformSchema (or fit, or transform) with "no default value found"

MLnick · 2017-03-21T23:18:57Z

+
+## Imputer
+
+Imputation transformer for completing missing values in the dataset, either using the mean or the 


Maybe something like "The Imputer transformer completes missing values in ..."

MLnick · 2017-03-21T23:19:22Z

+
+Imputation transformer for completing missing values in the dataset, either using the mean or the 
+median of the columns in which the missing value are located. The input columns should be of
+DoubleType or FloatType. Currently Imputer does not support categorical features and possibly


Backticks for DoubleType and FloatType

MLnick · 2017-03-21T23:19:59Z

+Imputation transformer for completing missing values in the dataset, either using the mean or the 
+median of the columns in which the missing value are located. The input columns should be of
+DoubleType or FloatType. Currently Imputer does not support categorical features and possibly
+creates incorrect values for a categorical feature. All Null values in the input column are


Perhaps on a new line:

Note all null values in the input column ...

MLnick · 2017-03-21T23:22:09Z

+     5.0    |     5.0   
+~~~
+
+By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from


Perhaps "In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) from the other values in the corresponding columns".

MLnick · 2017-03-21T23:22:26Z

+~~~
+
+By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from
+other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0


In this example, the surrogate values for columns a and b are ...

MLnick · 2017-03-21T23:24:06Z

Generally looks fine - made a few small comments.

SparkQA · 2017-03-22T06:37:28Z

Test build #75031 has started for PR 17324 at commit 4bbe2f7.

MLnick · 2017-03-22T18:36:57Z

Jenkins retest this please

SparkQA · 2017-03-22T19:29:17Z

Test build #75058 has finished for PR 17324 at commit 4bbe2f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-22T19:35:49Z

Test build #75059 has finished for PR 17324 at commit 8755dde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-24T15:16:32Z

@hhbyyh #17316 is merged.

SparkQA · 2017-03-25T02:56:24Z

Test build #75197 has finished for PR 17324 at commit a2e24c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-25T04:52:58Z

Updated with python example.

MLnick

A few mostly minor comments.

One missing thing is to include the Python example in user guide.

MLnick · 2017-03-27T12:28:25Z

+
+## Imputer
+
+The `Imputer` transformer completes missing values in the dataset, either using the mean or the 


"values in the dataset" -> "values in a dataset"

MLnick · 2017-03-27T12:28:52Z

+## Imputer
+
+The `Imputer` transformer completes missing values in the dataset, either using the mean or the 
+median of the columns in which the missing value are located. The input columns should be of


"value" -> "values"

MLnick · 2017-03-27T12:32:23Z

+The `Imputer` transformer completes missing values in the dataset, either using the mean or the 
+median of the columns in which the missing value are located. The input columns should be of
+`DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly
+creates incorrect values for a categorical feature.


"... creates incorrect values for columns containing categorical features."

MLnick · 2017-03-27T12:34:11Z

+
+**Examples**
+
+Suppose that we have a DataFrame with the column `a` and `b`:


MLnick · 2017-03-27T12:34:37Z

+     5.0    |     5.0   
+~~~
+
+In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value)


backticks around Double.NaN

MLnick · 2017-03-27T12:40:07Z

+    Dataset<Row> df = spark.createDataFrame(data, schema);
+
+    Imputer imputerModel = new Imputer()
+      .setStrategy("mean")


Since we're using defaults we can remove the setStrategy call in all examples.

For the example code, can we keep it to introduce the primary API or important parameters?

It's not a big deal - still I think it's not necessary to illustrate setStrategy("mean") as we already mention in the user guide what the defaults are.

MLnick · 2017-03-27T12:41:42Z

+if __name__ == "__main__":
+    spark = SparkSession\
+        .builder\
+        .appName("imputer example")\


Let's use "PythonImputerExample" to be consistent for app name used in other examples

Sure. For consistency, how about just use "ImputerExample"

MLnick · 2017-03-27T12:42:00Z

+        .getOrCreate()
+
+    # $example on$
+    dataFrame = spark.createDataFrame([


dataFrame -> df to be consistent with other examples

MLnick · 2017-03-27T12:44:19Z

+from pyspark.ml.feature import Imputer
+# $example off$
+from pyspark.sql import SparkSession
+


While I see that not all Python examples have it, let's add the comment here too:

""" An example demonstrating Imputer. Run with: bin/spark-submit examples/src/main/python/ml/imputer_example.py """

MLnick · 2017-03-27T12:44:47Z

@@ -0,0 +1,46 @@
+#


Prefer filename imputer_example.py to be consistent with other Python examples for ML

SparkQA · 2017-03-27T19:12:42Z

Test build #75271 has finished for PR 17324 at commit 7df70b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-28T08:09:37Z

+    });
+    Dataset<Row> df = spark.createDataFrame(data, schema);
+
+    Imputer imputerModel = new Imputer()


Sorry just noticed this imputerModel here and model below. Let's call it imputer and model.

Thanks for finding this.

MLnick · 2017-03-28T08:09:58Z

+    ], ["a", "b"])
+
+    imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
+    imputerModel = imputer.fit(df)


just model

MLnick · 2017-03-28T08:11:32Z

+    imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
+    imputerModel = imputer.fit(df)
+
+    imputedData = imputerModel.transform(df)


In the other examples we just do model.transform(df).show() so let's be consistent.

MLnick

A few minor clean up points, then I think it should be ready.

SparkQA · 2017-03-29T06:42:35Z

Test build #75346 has started for PR 17324 at commit 48a1361.

hhbyyh · 2017-03-30T05:51:57Z

The test was interrupted and need a retest.

MLnick · 2017-03-30T07:14:12Z

Jenkins retest this please

SparkQA · 2017-03-30T08:14:14Z

Test build #75383 has finished for PR 17324 at commit 48a1361.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-30T09:29:18Z

+    imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
+    model = imputer.fit(df)
+
+    model.transform(df).select("a", "b", "out_a", "out_b").show()


In previous comment I wasn't totally clear, sorry! I mean let's only have the transform(df).show() - so we can remove the select here as it's unnecessary.

MLnick

One last tweak to Python example.

LGTM pending that.

SparkQA · 2017-03-30T17:39:45Z

Test build #75396 has finished for PR 17324 at commit e17f997.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-04-03T09:41:47Z

Viewed generated docs and ran examples locally.

👍

Merged to master. Thanks!

YY-OnCall added 2 commits March 16, 2017 15:05

imputer doc and example

f2e7a69

add back extra line

ac0683b

MLnick reviewed Mar 21, 2017

View reviewed changes

YY-OnCall added 2 commits March 21, 2017 22:56

Merge remote-tracking branch 'upstream/master' into imputerdoc

30dbd1f

remove require

4bbe2f7

add example comments

8755dde

YY-OnCall added 2 commits March 24, 2017 18:37

Merge remote-tracking branch 'upstream/master' into imputerdoc

d3831a7

add python example

a2e24c0

MLnick suggested changes Mar 27, 2017

View reviewed changes

YY-OnCall added 2 commits March 27, 2017 10:23

Merge remote-tracking branch 'upstream/master' into imputerdoc

a0c348b

include python example

7df70b7

MLnick reviewed Mar 28, 2017

View reviewed changes

MLnick suggested changes Mar 28, 2017

View reviewed changes

YY-OnCall added 2 commits March 28, 2017 22:56

Merge remote-tracking branch 'upstream/master' into imputerdoc

125a4fc

variable rename

48a1361

MLnick reviewed Mar 30, 2017

View reviewed changes

remove select

e17f997

asfgit closed this in 4d28e84 Apr 3, 2017


		## Imputer

		Imputation transformer for completing missing values in the dataset, either using the mean or the


		## Imputer

		The `Imputer` transformer completes missing values in the dataset, either using the mean or the


		Examples

		Suppose that we have a DataFrame with the column `a` and `b`:

Conversation

hhbyyh commented Mar 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

MLnick commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MLnick Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Mar 21, 2017

Uh oh!

SparkQA commented Mar 22, 2017

Uh oh!

MLnick commented Mar 22, 2017

Uh oh!

SparkQA commented Mar 22, 2017

Uh oh!

SparkQA commented Mar 22, 2017

Uh oh!

MLnick commented Mar 24, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

hhbyyh commented Mar 25, 2017

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

MLnick commented Mar 21, 2017 •

edited

Loading

MLnick Mar 21, 2017 •

edited

Loading

MLnick Mar 21, 2017 •

edited

Loading

MLnick Mar 21, 2017 •

edited

Loading

MLnick Mar 21, 2017 •

edited

Loading

hhbyyh Mar 27, 2017 •

edited

Loading

hhbyyh Mar 27, 2017 •

edited

Loading