[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples #20257

viirya · 2018-01-13T08:09:54Z

What changes were proposed in this pull request?

We have OneHotEncoderEstimator now and OneHotEncoder will be deprecated since 2.3.0. We should add OneHotEncoderEstimator into mllib document.

We also need to provide corresponding examples for OneHotEncoderEstimator which are used in the document too.

How was this patch tested?

Existing tests.

SparkQA · 2018-01-13T08:14:35Z

Test build #86087 has finished for PR 20257 at commit 4e8f856.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaOneHotEncoderEstimatorExample

SparkQA · 2018-01-13T08:41:56Z

Test build #86089 has finished for PR 20257 at commit 05577df.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaOneHotEncoderEstimatorExample

viirya · 2018-01-13T10:03:31Z

cc @jkbradley @MLnick @WeichenXu123 I think we should update mllib document and example for OneHotEncoderEstimator in 2.3.0.

SparkQA · 2018-01-14T09:12:36Z

Test build #86117 has finished for PR 20257 at commit 21cb7d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaOneHotEncoderEstimatorExample

MLnick

Thanks for this - made a first pass

MLnick · 2018-01-15T09:58:26Z

docs/ml-features.md

 </div>

-## OneHotEncoder
+## OneHotEncoder (Deprecated since 2.3.0)


I think we should add a little more detail about why it's deprecated.

The reason is that because the existing OneHotEncoder is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new OneHotEncoderEstimator was created that produces a OneHotEncoderModel when fit. Add a link to the JIRA ticket for more detail (https://issues.apache.org/jira/browse/SPARK-13030).

Sure. Added.

MLnick · 2018-01-15T10:01:30Z

docs/ml-features.md

-## OneHotEncoder
+## OneHotEncoder (Deprecated since 2.3.0)
+
+`OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.


Since it is deprecated - and I think we should be pretty aggressive about moving users to the new estimator - what do folks think about removing the description and examples from this doc and just pointing to the new estimator as done in this sentence here?

I'd support this idea.

Let me remove them first. If there are any strong objections, I can add it back.

MLnick · 2018-01-15T10:03:38Z

docs/ml-features.md


+## OneHotEncoderEstimator
+
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.


We should add a note that it can handle multiple columns (and returns a one-hot-encoded output vector column for each input column, rather than merging into one output vector).

Also, what about describing the missing / invalid value handling in more detail?

MLnick · 2018-01-15T10:13:26Z

docs/ml-features.md

-## OneHotEncoder
+## OneHotEncoder (Deprecated since 2.3.0)
+
+`OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.


"will be" -> "has been"

and then "and will be removed"

MLnick · 2018-01-15T10:20:13Z

examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala

+      .getOrCreate()
+
+    // $example on$
+    val df = spark.createDataFrame(Seq(


I know the examples are re-creating the existing OneHotEncoder examples, but perhaps we should just drop the StringIndexer part and show a simplified example transforming the raw label indices to OHE vectors?

We could mention in the user guide that it is common to encode categorical features using StringIndexer first?

Ok for me. As an example, it seems a bit lengthy because the two StringIndexer.

viirya · 2018-01-16T03:52:13Z

@MLnick Thanks for review. I think I've addressed all the comments. Please take a look for the updates.

SparkQA · 2018-01-16T04:07:16Z

Test build #86153 has finished for PR 20257 at commit 262c046.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-16T10:53:07Z

docs/ml-features.md

+## OneHotEncoder (Deprecated since 2.3.0)

-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
+Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030).


Change the JIRA link to a Markdown link, e.g.

"see [SPARK-13030](...)"

MLnick · 2018-01-16T10:53:43Z

docs/ml-features.md

-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
+Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030).
+
+`OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) for one-hot encoding instead.


I think you can remove "for one-hot encoding" and just make it "use [OHE](...) instead"

MLnick · 2018-01-16T12:15:35Z

docs/ml-features.md

+
+`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
+
+`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).


"supports the ..."

MLnick · 2018-01-16T12:15:57Z

docs/ml-features.md

+
+`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
+
+`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).


"how to handle invalid input during ..."

and "(any invalid inputs are assigned to an extra ..."

"(... to an extra categorical number)"

MLnick · 2018-01-16T12:20:01Z

examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java

      .getOrCreate();

    // $example on$
+    // Notice: this categorical features are usually encoded with `StringIndexer`.


Perhaps we can move the note above the $example on$ - I don't think it is necessary for it to appear in the user guide as we've mentioned it above.

Also perhaps rather: Note: categorical features are usually first encoded with StringIndexer

MLnick · 2018-01-16T12:21:15Z

examples/src/main/python/ml/onehot_encoder_estimator_example.py

        .getOrCreate()

    # $example on$
+    # Notice: this categorical features are usually encoded with `StringIndexer`.


Same applies here

MLnick · 2018-01-16T12:21:32Z

examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala

      .getOrCreate()

    // $example on$
+    // Notice: this categorical features are usually encoded with `StringIndexer`.


Same applies here.

MLnick · 2018-01-16T12:23:16Z

docs/ml-features.md

+
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
+
+`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.


"can handle multi-column. By specifying ..." -> "can transform multiple columns, returning a one-hot-encoded output ..."

WeichenXu123 · 2018-01-16T19:01:31Z

docs/ml-features.md

+
+`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
+
+`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).


"(... to an extra categorical number)"

WeichenXu123 · 2018-01-16T19:10:51Z

examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java

-      new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
-      new StructField("category", DataTypes.StringType, false, Metadata.empty())
+      new StructField("categoryIndex1", DataTypes.DoubleType, false, Metadata.empty()),
+      new StructField("categoryIndex2", DataTypes.DoubleType, false, Metadata.empty())


Don't need to pass Metadata.empty() param, it's a default value.
We'd better to make the example code simpler.

Since this is java example, the default param seems don't work:

error: no suitable constructor found for StructField(String,DataType,boolean) [error] new StructField("categoryIndex1", DataTypes.DoubleType, false), [error] ^ [error] /root/repos/spark-1/constructor StructField.StructField(String,DataType,boolean,Metadata) is not applicable [error] (actual and formal argument lists differ in length) [error] constructor StructField.StructField() is not applicable

WeichenXu123 · 2018-01-16T19:19:22Z

docs/ml-features.md

+
+## OneHotEncoderEstimator
+
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.


"with at most a single one-value" --> "each output binary vector include at most a single one-value"

viirya · 2018-01-17T05:23:44Z

@MLnick @WeichenXu123 Your comments are addressed. Please check this again. Thanks.

SparkQA · 2018-01-17T05:30:08Z

Test build #86244 has finished for PR 20257 at commit e57d9ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-17T09:00:38Z

docs/ml-features.md

+`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.

-`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).
+`OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical number) and 'error' (throw an error).


perhaps "extra categorical number" would read better as "extra categorical index"?

MLnick · 2018-01-17T09:03:52Z

docs/ml-features.md

 ## OneHotEncoderEstimator

-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.


I don't really like this description as I think it conflates the core of what one-hot-encoding does with the implementation detail of dataframe columns (which we refer to in the next paragraph anyway).

How about "[OHE](...) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values."

MLnick · 2018-01-17T09:07:44Z

docs/ml-features.md

+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.

-`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
+`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.


Perhaps we should add a note about vector assembling, something like "It is common to merge these vectors into a single feature vector using VectorAssembler"?

MLnick · 2018-01-17T09:11:22Z

Added a few more small comments

viirya · 2018-01-17T10:57:45Z

@MLnick Changed as you suggested.

SparkQA · 2018-01-17T11:13:29Z

Test build #86263 has finished for PR 20257 at commit 18cf226.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-17T12:36:47Z

docs/ml-features.md

+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.

-`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.
+`OneHotEncoderEstimator` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using `VectorAssembler`.


Add Markdown link for VectorAssembler

MLnick · 2018-01-17T12:49:38Z

docs/ml-features.md

 ## OneHotEncoderEstimator

-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.


@viirya sorry for any confusion but I didn't intend you to remove these sentences:

This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.

No problem. Added it back.

MLnick · 2018-01-17T12:51:10Z

A couple minor comments, otherwise looks fine.

I see we are changing the example names, so effectively removing the old examples. I'm ok with this, unless others have an objection?

SparkQA · 2018-01-17T13:24:30Z

Test build #86269 has finished for PR 20257 at commit 3c697bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick

LGTM now, thanks. @WeichenXu123 ?

WeichenXu123 · 2018-01-19T00:31:43Z

Nice, LGTM. Thanks!

## What changes were proposed in this pull request? We have `OneHotEncoderEstimator` now and `OneHotEncoder` will be deprecated since 2.3.0. We should add `OneHotEncoderEstimator` into mllib document. We also need to provide corresponding examples for `OneHotEncoderEstimator` which are used in the document too. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20257 from viirya/SPARK-23048. (cherry picked from commit b743664) Signed-off-by: Nick Pentreath <[email protected]>

MLnick · 2018-01-19T10:49:36Z

Merged to master / branch-2.3, thanks!

viirya force-pushed the SPARK-23048 branch from 4e8f856 to 05577df Compare January 13, 2018 08:24

Update mllib docs for OneHotEncoderEstimator.

21cb7d3

viirya force-pushed the SPARK-23048 branch from 05577df to 21cb7d3 Compare January 14, 2018 08:54

MLnick suggested changes Jan 15, 2018

View reviewed changes

viirya added 2 commits January 16, 2018 03:47

Address comment.

13a7b90

Remove OneHotEncoder examples.

262c046

MLnick reviewed Jan 16, 2018

View reviewed changes

WeichenXu123 reviewed Jan 16, 2018

View reviewed changes

Address comments.

e57d9ee

MLnick reviewed Jan 17, 2018

View reviewed changes

Address comments.

18cf226

MLnick reviewed Jan 17, 2018

View reviewed changes

Add markdown link.

3c697bd

MLnick approved these changes Jan 18, 2018

View reviewed changes

asfgit closed this in b743664 Jan 19, 2018

viirya deleted the SPARK-23048 branch December 27, 2023 18:35


		## OneHotEncoderEstimator

		[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.


		`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.

		`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).

[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples #20257

[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples #20257

Uh oh!

Conversation

viirya commented Jan 13, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 13, 2018

Uh oh!

SparkQA commented Jan 13, 2018

Uh oh!

viirya commented Jan 13, 2018

Uh oh!

SparkQA commented Jan 14, 2018

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 17, 2018

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Jan 17, 2018

MLnick Jan 17, 2018 •

edited

Loading