-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib #12079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #54623 has finished for PR 12079 at commit
|
|
Test build #54631 has finished for PR 12079 at commit
|
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the doc of the Param consistent with Scala.
|
Test build #54642 has finished for PR 12079 at commit
|
…HashingTF in ML & MLlib This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
|
Rebased to fix conflicts. |
|
Test build #55524 has finished for PR 12079 at commit
|
|
One minor note:Often we want to go with Scala first then Python, but in either direction if we are only doing one at a time it can be good practice to create either a follow up JIRA or a subtask on the existing JIRA to also expose the implementation in the other language. |
python/pyspark/ml/feature.py
Outdated
| .. versionadded:: 1.3.0 | ||
| """ | ||
|
|
||
| binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to mention the default value here (namely false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @holdenk this issue has been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Looking at the incoming PRs it seems there is a second PR also adding a binary feature to another model - it might make sense to move this to a shared param instead of having it be per-model (although it will require coordination with the other PR timing wise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #12308 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if true -> if True
|
@holdenk The Scala implementation has ben completed in SPARK-13963. I updated the description of this pull request to show the linkage between this issue (SPARK-14238) and SPARK-13963. |
|
Test build #55587 has finished for PR 12079 at commit
|
python/pyspark/ml/feature.py
Outdated
|
|
||
| binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " + | ||
| "This is useful for discrete probabilistic models that model binary events " + | ||
| "rather than integer counts. (default: False)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style seems to be . Default False rather than . (default: False). @BryanCutler @holdenk thoughts?
Though I must say I'd prefer (default: X). across the board myself.
|
A few minor comments, otherwise LGTM. @holdenk @BryanCutler we could merge this and #12308, and then update the param to be shared (if we can do the different doc thing?). |
|
Thanks @MLnick I just updated the pull request to address several minor issues. With respect to |
|
Test build #55608 has finished for PR 12079 at commit
|
I think that will be better and maybe then we can change the param to be shared on both the Scala and Python side. |
|
@BryanCutler / @yongtang That sounds reasonable :) |
|
As per @jkbradley's #12308 (comment), let's keep them separate params. |
|
jenkins retest this please |
|
Test build #55839 has finished for PR 12079 at commit
|
|
LGTM. Merged to master. |
What changes were proposed in this pull request?
This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
How was this patch tested?
This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.