Skip to content
55 changes: 55 additions & 0 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1284,6 +1284,61 @@ for more details on the API.

</div>


## Imputer

Imputation transformer for completing missing values in the dataset, either using the mean or the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like "The Imputer transformer completes missing values in ..."

median of the columns in which the missing value are located. The input columns should be of

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"value" -> "values"

DoubleType or FloatType. Currently Imputer does not support categorical features and possibly

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks for DoubleType and FloatType

creates incorrect values for a categorical feature. All Null values in the input column are

@MLnick MLnick Mar 21, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps on a new line:

Note all null values in the input column ...

treated as missing, and so are also imputed.

**Examples**

Suppose that we have a DataFrame with the column `a` and `b`:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns


~~~
a | b
------------|-----------
1.0 | Double.NaN
2.0 | Double.NaN
Double.NaN | 3.0
4.0 | 4.0
5.0 | 5.0
~~~

By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) from the other values in the corresponding columns".

other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0

@MLnick MLnick Mar 21, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, the surrogate values for columns a and b are ...

and 4.0 respectively. After transformation, the output columns will not contain missing value anymore.

@MLnick MLnick Mar 21, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "After transformation, the missing values in the output columns will be replaced by the surrogate value for that column"?


~~~
a | b | out_a | out_b
------------|------------|-------|-------
1.0 | Double.NaN | 1.0 | 4.0
2.0 | Double.NaN | 2.0 | 4.0
Double.NaN | 3.0 | 3.0 | 3.0
4.0 | 4.0 | 4.0 | 4.0
5.0 | 5.0 | 5.0 | 5.0
~~~

<div class="codetabs">
<div data-lang="scala" markdown="1">

Refer to the [Imputer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Imputer)
for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
</div>

<div data-lang="java" markdown="1">

Refer to the [Imputer Java docs](api/java/org/apache/spark/ml/feature/Imputer.html)
for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaImputerExample.java %}
</div>
</div>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to include_example for the Python example here.


# Feature Selectors

## VectorSlicer
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.ml;

// $example on$
import java.util.Arrays;
import java.util.List;

import org.apache.spark.ml.feature.Imputer;
import org.apache.spark.ml.feature.ImputerModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
// $example off$

import static org.apache.spark.sql.types.DataTypes.*;

public class JavaImputerExample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("JavaImputerExample")
.getOrCreate();

// $example on$
List<Row> data = Arrays.asList(
RowFactory.create(1.0, Double.NaN),
RowFactory.create(2.0, Double.NaN),
RowFactory.create(Double.NaN, 3.0),
RowFactory.create(4.0, 4.0),
RowFactory.create(5.0, 5.0)
);
StructType schema = new StructType(new StructField[]{
createStructField("a", DoubleType, false),
createStructField("b", DoubleType, false)
});
Dataset<Row> df = spark.createDataFrame(data, schema);

Imputer imputerModel = new Imputer()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just noticed this imputerModel here and model below. Let's call it imputer and model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this.

.setStrategy("mean")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're using defaults we can remove the setStrategy call in all examples.

@hhbyyh hhbyyh Mar 27, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the example code, can we keep it to introduce the primary API or important parameters?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal - still I think it's not necessary to illustrate setStrategy("mean") as we already mention in the user guide what the defaults are.

.setInputCols(new String[]{"a", "b"})
.setOutputCols(new String[]{"out_a", "out_b"});

ImputerModel model = imputerModel.fit(df);
model.transform(df).show();
// $example off$

spark.stop();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.feature.Imputer
// $example off$
import org.apache.spark.sql.SparkSession

@MLnick MLnick Mar 21, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most examples have a small doc string that includes a "Run with:" part - see e.g. the recent MinHashLSHExample (this should also be added for the Java example)

object ImputerExample {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("ImputerExample")
.getOrCreate()

// $example on$
val df = spark.createDataFrame( Seq(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Space in ( Seq( should be removed

(1.0, Double.NaN),
(2.0, Double.NaN),
(Double.NaN, 3.0),
(4.0, 4.0),
(5.0, 5.0)
)).toDF("a", "b")

val imputer = new Imputer()
.setStrategy("mean")
.setInputCols(Array("a", "b"))
.setOutputCols(Array("out_a", "out_b"))

val model = imputer.fit(df)
model.transform(df).show()
// $example off$

spark.stop()
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ import org.apache.spark.sql.types._
private[feature] trait ImputerParams extends Params with HasInputCols {

/**
* The imputation strategy.
* The imputation strategy. Currently only "mean" and "median" are supported.
* If "mean", then replace missing values using the mean value of the feature.
* If "median", then replace missing values using the approximate median value of the feature.
* Default: mean
Expand Down Expand Up @@ -75,6 +75,8 @@ private[feature] trait ImputerParams extends Params with HasInputCols {

/** Validates and transforms the input schema. */
protected def validateAndTransformSchema(schema: StructType): StructType = {
require(get(inputCols).isDefined, "Input cols must be defined first.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #17316, is this really required? Since a non-set param for these will in any case throw an exception during transformSchema (or fit, or transform) with "no default value found"

require(get(outputCols).isDefined, "Output cols must be defined first.")
require($(inputCols).length == $(inputCols).distinct.length, s"inputCols contains" +
s" duplicates: (${$(inputCols).mkString(", ")})")
require($(outputCols).length == $(outputCols).distinct.length, s"outputCols contains" +
Expand Down