-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples #20257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #86087 has finished for PR 20257 at commit
|
|
Test build #86089 has finished for PR 20257 at commit
|
|
cc @jkbradley @MLnick @WeichenXu123 I think we should update mllib document and example for OneHotEncoderEstimator in 2.3.0. |
|
Test build #86117 has finished for PR 20257 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this - made a first pass
| </div> | ||
|
|
||
| ## OneHotEncoder | ||
| ## OneHotEncoder (Deprecated since 2.3.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a little more detail about why it's deprecated.
The reason is that because the existing OneHotEncoder is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new OneHotEncoderEstimator was created that produces a OneHotEncoderModel when fit. Add a link to the JIRA ticket for more detail (https://issues.apache.org/jira/browse/SPARK-13030).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Added.
docs/ml-features.md
Outdated
| ## OneHotEncoder | ||
| ## OneHotEncoder (Deprecated since 2.3.0) | ||
|
|
||
| `OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is deprecated - and I think we should be pretty aggressive about moving users to the new estimator - what do folks think about removing the description and examples from this doc and just pointing to the new estimator as done in this sentence here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd support this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me remove them first. If there are any strong objections, I can add it back.
docs/ml-features.md
Outdated
|
|
||
| ## OneHotEncoderEstimator | ||
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a note that it can handle multiple columns (and returns a one-hot-encoded output vector column for each input column, rather than merging into one output vector).
Also, what about describing the missing / invalid value handling in more detail?
docs/ml-features.md
Outdated
| ## OneHotEncoder | ||
| ## OneHotEncoder (Deprecated since 2.3.0) | ||
|
|
||
| `OneHotEncoder` will be deprecated in 2.3.0 and removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"will be" -> "has been"
and then "and will be removed"
| .getOrCreate() | ||
|
|
||
| // $example on$ | ||
| val df = spark.createDataFrame(Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the examples are re-creating the existing OneHotEncoder examples, but perhaps we should just drop the StringIndexer part and show a simplified example transforming the raw label indices to OHE vectors?
We could mention in the user guide that it is common to encode categorical features using StringIndexer first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for me. As an example, it seems a bit lengthy because the two StringIndexer.
|
@MLnick Thanks for review. I think I've addressed all the comments. Please take a look for the updates. |
|
Test build #86153 has finished for PR 20257 at commit
|
docs/ml-features.md
Outdated
| ## OneHotEncoder (Deprecated since 2.3.0) | ||
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. | ||
| Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the JIRA link to a Markdown link, e.g.
"see [SPARK-13030](...)"
docs/ml-features.md
Outdated
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. | ||
| Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-13030). | ||
|
|
||
| `OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) for one-hot encoding instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove "for one-hot encoding" and just make it "use [OHE](...) instead"
docs/ml-features.md
Outdated
|
|
||
| `OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column. | ||
|
|
||
| `OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"supports the ..."
docs/ml-features.md
Outdated
|
|
||
| `OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column. | ||
|
|
||
| `OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"how to handle invalid input during ..."
and "(any invalid inputs are assigned to an extra ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(... to an extra categorical number)"
| .getOrCreate(); | ||
|
|
||
| // $example on$ | ||
| // Notice: this categorical features are usually encoded with `StringIndexer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can move the note above the $example on$ - I don't think it is necessary for it to appear in the user guide as we've mentioned it above.
Also perhaps rather: Note: categorical features are usually first encoded with StringIndexer
| .getOrCreate() | ||
|
|
||
| # $example on$ | ||
| # Notice: this categorical features are usually encoded with `StringIndexer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies here
| .getOrCreate() | ||
|
|
||
| // $example on$ | ||
| // Notice: this categorical features are usually encoded with `StringIndexer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies here.
docs/ml-features.md
Outdated
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. | ||
|
|
||
| `OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"can handle multi-column. By specifying ..." -> "can transform multiple columns, returning a one-hot-encoded output ..."
docs/ml-features.md
Outdated
|
|
||
| `OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column. | ||
|
|
||
| `OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(... to an extra categorical number)"
| new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), | ||
| new StructField("category", DataTypes.StringType, false, Metadata.empty()) | ||
| new StructField("categoryIndex1", DataTypes.DoubleType, false, Metadata.empty()), | ||
| new StructField("categoryIndex2", DataTypes.DoubleType, false, Metadata.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to pass Metadata.empty() param, it's a default value.
We'd better to make the example code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is java example, the default param seems don't work:
error: no suitable constructor found for StructField(String,DataType,boolean)
[error] new StructField("categoryIndex1", DataTypes.DoubleType, false),
[error] ^
[error] /root/repos/spark-1/constructor StructField.StructField(String,DataType,boolean,Metadata) is not applicable
[error] (actual and formal argument lists differ in length)
[error] constructor StructField.StructField() is not applicable
docs/ml-features.md
Outdated
|
|
||
| ## OneHotEncoderEstimator | ||
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with at most a single one-value" --> "each output binary vector include at most a single one-value"
|
@MLnick @WeichenXu123 Your comments are addressed. Please check this again. Thanks. |
|
Test build #86244 has finished for PR 20257 at commit
|
docs/ml-features.md
Outdated
| `OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column. | ||
|
|
||
| `OneHotEncoderEstimator` supports `handleInvalid` parameter to choose how to handle invalid data during transforming data. Available options include 'keep' (invalid data presented as an extra categorical feature) and 'error' (throw an error). | ||
| `OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical number) and 'error' (throw an error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps "extra categorical number" would read better as "extra categorical index"?
docs/ml-features.md
Outdated
| ## OneHotEncoderEstimator | ||
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. | ||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this description as I think it conflates the core of what one-hot-encoding does with the implementation detail of dataframe columns (which we refer to in the next paragraph anyway).
How about "[OHE](...) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values."
docs/ml-features.md
Outdated
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. | ||
|
|
||
| `OneHotEncoderEstimator` can handle multi-column. By specifying multiple input columns, it returns a one-hot-encoded output vector column for each input column. | ||
| `OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add a note about vector assembling, something like "It is common to merge these vectors into a single feature vector using VectorAssembler"?
|
Added a few more small comments |
|
@MLnick Changed as you suggested. |
|
Test build #86263 has finished for PR 20257 at commit
|
docs/ml-features.md
Outdated
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. | ||
|
|
||
| `OneHotEncoderEstimator` can transform multiple columns, returning a one-hot-encoded output vector column for each input column. | ||
| `OneHotEncoderEstimator` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using `VectorAssembler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Markdown link for VectorAssembler
docs/ml-features.md
Outdated
| ## OneHotEncoderEstimator | ||
|
|
||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, and each output binary vector includes at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first. | ||
| [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya sorry for any confusion but I didn't intend you to remove these sentences:
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Added it back.
|
A couple minor comments, otherwise looks fine. I see we are changing the example names, so effectively removing the old examples. I'm ok with this, unless others have an objection? |
|
Test build #86269 has finished for PR 20257 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks. @WeichenXu123 ?
|
Nice, LGTM. Thanks! |
## What changes were proposed in this pull request? We have `OneHotEncoderEstimator` now and `OneHotEncoder` will be deprecated since 2.3.0. We should add `OneHotEncoderEstimator` into mllib document. We also need to provide corresponding examples for `OneHotEncoderEstimator` which are used in the document too. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20257 from viirya/SPARK-23048. (cherry picked from commit b743664) Signed-off-by: Nick Pentreath <[email protected]>
|
Merged to master / branch-2.3, thanks! |
What changes were proposed in this pull request?
We have
OneHotEncoderEstimatornow andOneHotEncoderwill be deprecated since 2.3.0. We should addOneHotEncoderEstimatorinto mllib document.We also need to provide corresponding examples for
OneHotEncoderEstimatorwhich are used in the document too.How was this patch tested?
Existing tests.