Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jan 13, 2018

What changes were proposed in this pull request?

We have OneHotEncoderEstimator now and OneHotEncoder will be deprecated since 2.3.0. We should add OneHotEncoderEstimator into mllib document.

We also need to provide corresponding examples for OneHotEncoderEstimator which are used in the document too.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86087 has finished for PR 20257 at commit 4e8f856.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaOneHotEncoderEstimatorExample

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86089 has finished for PR 20257 at commit 05577df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaOneHotEncoderEstimatorExample

@viirya
Copy link
Member Author

viirya commented Jan 13, 2018

cc @jkbradley @MLnick @WeichenXu123 I think we should update mllib document and example for OneHotEncoderEstimator in 2.3.0.

@SparkQA
Copy link

SparkQA commented Jan 14, 2018

Test build #86117 has finished for PR 20257 at commit 21cb7d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaOneHotEncoderEstimatorExample

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - made a first pass

</div>

## OneHotEncoder
## OneHotEncoder (Deprecated since 2.3.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a little more detail about why it's deprecated.

The reason is that because the existing OneHotEncoder is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new OneHotEncoderEstimator was created that produces a OneHotEncoderModel when fit. Add a link to the JIRA ticket for more detail (https://issues.apache.org/jira/browse/SPARK-13030).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added.

## OneHotEncoder
## OneHotEncoder (Deprecated since 2.3.0)

`OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is deprecated - and I think we should be pretty aggressive about moving users to the new estimator - what do folks think about removing the description and examples from this doc and just pointing to the new estimator as done in this sentence here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd support this idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove them first. If there are any strong objections, I can add it back.


## OneHotEncoderEstimator

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a note that it can handle multiple columns (and returns a one-hot-encoded output vector column for each input column, rather than merging into one output vector).

Also, what about describing the missing / invalid value handling in more detail?

## OneHotEncoder
## OneHotEncoder (Deprecated since 2.3.0)

`OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be" -> "has been"

and then "and will be removed"

.getOrCreate()

// $example on$
val df = spark.createDataFrame(Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the examples are re-creating the existing OneHotEncoder examples, but perhaps we should just drop the StringIndexer part and show a simplified example transforming the raw label indices to OHE vectors?

We could mention in the user guide that it is common to encode categorical features using StringIndexer first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me. As an example, it seems a bit lengthy because the two StringIndexer.

@viirya
Copy link
Member Author

viirya commented Jan 16, 2018

@MLnick Thanks for review. I think I've addressed all the comments. Please take a look for the updates.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86153 has finished for PR 20257 at commit 262c046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

## OneHotEncoder (Deprecated since 2.3.0)

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the JIRA link to a Markdown link, e.g.

"see [SPARK-13030](...)"

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030).

`OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) for one-hot encoding instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove "for one-hot encoding" and just make it "use [OHE](...) instead"


`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.

`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"supports the ..."


`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.

`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"how to handle invalid input during ..."

and "(any invalid inputs are assigned to an extra ..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(... to an extra categorical number)"

.getOrCreate();

// $example on$
// Notice: this categorical features are usually encoded with `StringIndexer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can move the note above the $example on$ - I don't think it is necessary for it to appear in the user guide as we've mentioned it above.

Also perhaps rather: Note: categorical features are usually first encoded with StringIndexer

.getOrCreate()

# $example on$
# Notice: this categorical features are usually encoded with `StringIndexer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here

.getOrCreate()

// $example on$
// Notice: this categorical features are usually encoded with `StringIndexer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here.


[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.

`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"can handle multi-column. By specifying ..." -> "can transform multiple columns, returning a one-hot-encoded output ..."


`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.

`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(... to an extra categorical number)"

new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("category", DataTypes.StringType, false, Metadata.empty())
new StructField("categoryIndex1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("categoryIndex2", DataTypes.DoubleType, false, Metadata.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to pass Metadata.empty() param, it's a default value.
We'd better to make the example code simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is java example, the default param seems don't work:

error: no suitable constructor found for StructField(String,DataType,boolean)
[error]       new StructField("categoryIndex1", DataTypes.DoubleType, false),
[error]       ^
[error] /root/repos/spark-1/constructor StructField.StructField(String,DataType,boolean,Metadata) is not applicable
[error]       (actual and formal argument lists differ in length)
[error]     constructor StructField.StructField() is not applicable


## OneHotEncoderEstimator

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with at most a single one-value" --> "each output binary vector include at most a single one-value"

@viirya
Copy link
Member Author

viirya commented Jan 17, 2018

@MLnick @WeichenXu123 Your comments are addressed. Please check this again. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86244 has finished for PR 20257 at commit e57d9ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.

`OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error).
`OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical number) and 'error' (throw an error).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps "extra categorical number" would read better as "extra categorical index"?

## OneHotEncoderEstimator

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
Copy link
Contributor

@MLnick MLnick Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this description as I think it conflates the core of what one-hot-encoding does with the implementation detail of dataframe columns (which we refer to in the next paragraph anyway).

How about "[OHE](...) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values."

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.

`OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column.
`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a note about vector assembling, something like "It is common to merge these vectors into a single feature vector using VectorAssembler"?

@MLnick
Copy link
Contributor

MLnick commented Jan 17, 2018

Added a few more small comments

@viirya
Copy link
Member Author

viirya commented Jan 17, 2018

@MLnick Changed as you suggested.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86263 has finished for PR 20257 at commit 18cf226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.

`OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column.
`OneHotEncoderEstimator` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using `VectorAssembler`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Markdown link for VectorAssembler

## OneHotEncoderEstimator

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya sorry for any confusion but I didn't intend you to remove these sentences:

This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Added it back.

@MLnick
Copy link
Contributor

MLnick commented Jan 17, 2018

A couple minor comments, otherwise looks fine.

I see we are changing the example names, so effectively removing the old examples. I'm ok with this, unless others have an objection?

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86269 has finished for PR 20257 at commit 3c697bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks. @WeichenXu123 ?

@WeichenXu123
Copy link
Contributor

Nice, LGTM. Thanks!

asfgit pushed a commit that referenced this pull request Jan 19, 2018
## What changes were proposed in this pull request?

We have `OneHotEncoderEstimator` now and `OneHotEncoder` will be deprecated since 2.3.0. We should add `OneHotEncoderEstimator` into mllib document.

We also need to provide corresponding examples for `OneHotEncoderEstimator` which are used in the document too.

## How was this patch tested?

Existing tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes #20257 from viirya/SPARK-23048.

(cherry picked from commit b743664)
Signed-off-by: Nick Pentreath <[email protected]>
@MLnick
Copy link
Contributor

MLnick commented Jan 19, 2018

Merged to master / branch-2.3, thanks!

@asfgit asfgit closed this in b743664 Jan 19, 2018
@viirya viirya deleted the SPARK-23048 branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants