[SPARK-19969] [ML] Imputer doc and example#17324
[SPARK-19969] [ML] Imputer doc and example#17324hhbyyh wants to merge 12 commits intoapache:masterfrom
Conversation
|
Test build #74689 has finished for PR 17324 at commit
|
|
Test build #74690 has finished for PR 17324 at commit
|
|
Will take a look this week - also we may want to add the Python example here once I merge #17316 |
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from | ||
| other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0 | ||
| and 4.0 respectively. After transformation, the output columns will not contain missing value anymore. |
There was a problem hiding this comment.
Perhaps "After transformation, the missing values in the output columns will be replaced by the surrogate value for that column"?
| import org.apache.spark.ml.feature.Imputer | ||
| // $example off$ | ||
| import org.apache.spark.sql.SparkSession | ||
|
|
There was a problem hiding this comment.
Most examples have a small doc string that includes a "Run with:" part - see e.g. the recent MinHashLSHExample (this should also be added for the Java example)
| .getOrCreate() | ||
|
|
||
| // $example on$ | ||
| val df = spark.createDataFrame( Seq( |
There was a problem hiding this comment.
Nit: Space in ( Seq( should be removed
|
|
||
| /** Validates and transforms the input schema. */ | ||
| protected def validateAndTransformSchema(schema: StructType): StructType = { | ||
| require(get(inputCols).isDefined, "Input cols must be defined first.") |
There was a problem hiding this comment.
As I mentioned in #17316, is this really required? Since a non-set param for these will in any case throw an exception during transformSchema (or fit, or transform) with "no default value found"
|
|
||
| ## Imputer | ||
|
|
||
| Imputation transformer for completing missing values in the dataset, either using the mean or the |
There was a problem hiding this comment.
Maybe something like "The Imputer transformer completes missing values in ..."
|
|
||
| Imputation transformer for completing missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| DoubleType or FloatType. Currently Imputer does not support categorical features and possibly |
There was a problem hiding this comment.
Backticks for DoubleType and FloatType
| Imputation transformer for completing missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| DoubleType or FloatType. Currently Imputer does not support categorical features and possibly | ||
| creates incorrect values for a categorical feature. All Null values in the input column are |
There was a problem hiding this comment.
Perhaps on a new line:
Note all null values in the input column ...
| 5.0 | 5.0 | ||
| ~~~ | ||
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from |
There was a problem hiding this comment.
Perhaps "In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) from the other values in the corresponding columns".
| ~~~ | ||
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from | ||
| other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0 |
There was a problem hiding this comment.
In this example, the surrogate values for columns a and b are ...
|
Generally looks fine - made a few small comments. |
|
Test build #75031 has started for PR 17324 at commit |
|
Jenkins retest this please |
|
Test build #75058 has finished for PR 17324 at commit
|
|
Test build #75059 has finished for PR 17324 at commit
|
|
Test build #75197 has finished for PR 17324 at commit
|
|
Updated with python example. |
MLnick
left a comment
There was a problem hiding this comment.
A few mostly minor comments.
One missing thing is to include the Python example in user guide.
|
|
||
| ## Imputer | ||
|
|
||
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the |
There was a problem hiding this comment.
"values in the dataset" -> "values in a dataset"
| ## Imputer | ||
|
|
||
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of |
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| `DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly | ||
| creates incorrect values for a categorical feature. |
There was a problem hiding this comment.
"... creates incorrect values for columns containing categorical features."
|
|
||
| **Examples** | ||
|
|
||
| Suppose that we have a DataFrame with the column `a` and `b`: |
| 5.0 | 5.0 | ||
| ~~~ | ||
|
|
||
| In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) |
There was a problem hiding this comment.
backticks around Double.NaN
| Dataset<Row> df = spark.createDataFrame(data, schema); | ||
|
|
||
| Imputer imputerModel = new Imputer() | ||
| .setStrategy("mean") |
There was a problem hiding this comment.
Since we're using defaults we can remove the setStrategy call in all examples.
There was a problem hiding this comment.
For the example code, can we keep it to introduce the primary API or important parameters?
There was a problem hiding this comment.
It's not a big deal - still I think it's not necessary to illustrate setStrategy("mean") as we already mention in the user guide what the defaults are.
| if __name__ == "__main__": | ||
| spark = SparkSession\ | ||
| .builder\ | ||
| .appName("imputer example")\ |
There was a problem hiding this comment.
Let's use "PythonImputerExample" to be consistent for app name used in other examples
There was a problem hiding this comment.
Sure. For consistency, how about just use "ImputerExample"
| .getOrCreate() | ||
|
|
||
| # $example on$ | ||
| dataFrame = spark.createDataFrame([ |
There was a problem hiding this comment.
dataFrame -> df to be consistent with other examples
| from pyspark.ml.feature import Imputer | ||
| # $example off$ | ||
| from pyspark.sql import SparkSession | ||
|
|
There was a problem hiding this comment.
While I see that not all Python examples have it, let's add the comment here too:
"""
An example demonstrating Imputer.
Run with:
bin/spark-submit examples/src/main/python/ml/imputer_example.py
"""| @@ -0,0 +1,46 @@ | |||
| # | |||
There was a problem hiding this comment.
Prefer filename imputer_example.py to be consistent with other Python examples for ML
|
Test build #75271 has finished for PR 17324 at commit
|
| }); | ||
| Dataset<Row> df = spark.createDataFrame(data, schema); | ||
|
|
||
| Imputer imputerModel = new Imputer() |
There was a problem hiding this comment.
Sorry just noticed this imputerModel here and model below. Let's call it imputer and model.
There was a problem hiding this comment.
Thanks for finding this.
| ], ["a", "b"]) | ||
|
|
||
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| imputerModel = imputer.fit(df) |
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| imputerModel = imputer.fit(df) | ||
|
|
||
| imputedData = imputerModel.transform(df) |
There was a problem hiding this comment.
In the other examples we just do model.transform(df).show() so let's be consistent.
MLnick
left a comment
There was a problem hiding this comment.
A few minor clean up points, then I think it should be ready.
|
Test build #75346 has started for PR 17324 at commit |
|
The test was interrupted and need a retest. |
|
Jenkins retest this please |
|
Test build #75383 has finished for PR 17324 at commit
|
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| model = imputer.fit(df) | ||
|
|
||
| model.transform(df).select("a", "b", "out_a", "out_b").show() |
There was a problem hiding this comment.
In previous comment I wasn't totally clear, sorry! I mean let's only have the transform(df).show() - so we can remove the select here as it's unnecessary.
MLnick
left a comment
There was a problem hiding this comment.
One last tweak to Python example.
LGTM pending that.
|
Test build #75396 has finished for PR 17324 at commit
|
|
Viewed generated docs and ran examples locally. 👍 Merged to master. Thanks! |
What changes were proposed in this pull request?
Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after #17316
How was this patch tested?
local doc generation and example execution