-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15957] [ML] RFormula supports forcing to index label #13675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60538 has finished for PR 13675 at commit
|
|
Test build #60598 has finished for PR 13675 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up this to 2.1.0 now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to "forceIndexLabel" since we can still index String labels when indexLabel=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, done.
|
Test build #3301 has finished for PR 13675 at commit
|
a7015ef to
3a8660f
Compare
|
Test build #66661 has finished for PR 13675 at commit
|
|
Test build #66662 has finished for PR 13675 at commit
|
| * Usually we index label only when it is string type. | ||
| * If the formula was used by classification algorithms, | ||
| * we can force to index label even it is numeric type by setting this param with true. | ||
| * Default: false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@group param
|
I just noticed that last item, but otherwise, this looks ready to me. Thanks! |
|
Test build #66693 has finished for PR 13675 at commit
|
|
Does this affect R code - could we add some R tests for this? |
|
@felixcheung This PR does not affect R code, I will send another PR to fix issues like SPARK-15153 which need to add some R tests. |
|
I'll merge this into master, thanks for review! @jkbradley @felixcheung |
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <[email protected]> Closes apache#15430 from yanboliang/spark-15957-python.
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <[email protected]> Closes apache#15430 from yanboliang/spark-15957-python.
## What changes were proposed in this pull request? ```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153). For regression, we will still to keep label as numeric type. In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```. ## How was this patch tested? Unit tests. Author: Yanbo Liang <[email protected]> Closes apache#13675 from yanboliang/spark-15957.
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <[email protected]> Closes apache#15430 from yanboliang/spark-15957-python.
What changes were proposed in this pull request?
RFormulawill index label only when it is string type currently. If the label is numeric type and we useRFormulato present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label byStringIndexerwhether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with SPARK-15153.For regression, we will still to keep label as numeric type.
In this PR, we add a param
indexLabelto control whether to force to index label forRFormula.How was this patch tested?
Unit tests.