[SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib #12079

yongtang · 2016-03-31T03:55:59Z

What changes were proposed in this pull request?

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.

Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.

How was this patch tested?

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

MLnick · 2016-03-31T13:40:29Z

ok to test

SparkQA · 2016-03-31T13:43:18Z

Test build #54623 has finished for PR 12079 at commit e58d1a2.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T14:51:51Z

Test build #54631 has finished for PR 12079 at commit 1e24a68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-31T15:19:19Z

python/pyspark/ml/feature.py

We should keep the doc of the Param consistent with Scala.

SparkQA · 2016-03-31T16:38:38Z

Test build #54642 has finished for PR 12079 at commit a71f59b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…HashingTF in ML & MLlib This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

yongtang · 2016-04-11T14:38:58Z

Rebased to fix conflicts.

SparkQA · 2016-04-11T14:53:53Z

Test build #55524 has finished for PR 12079 at commit 829c87e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-11T20:13:54Z

One minor note:Often we want to go with Scala first then Python, but in either direction if we are only doing one at a time it can be good practice to create either a follow up JIRA or a subtask on the existing JIRA to also expose the implementation in the other language.

holdenk · 2016-04-11T20:15:53Z

python/pyspark/ml/feature.py

    .. versionadded:: 1.3.0
    """

+    binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " +


We probably want to mention the default value here (namely false).

Thanks @holdenk this issue has been addressed.

Great! Looking at the incoming PRs it seems there is a second PR also adding a binary feature to another model - it might make sense to move this to a shared param instead of having it be per-model (although it will require coordination with the other PR timing wise).

See #12308 (comment)

if true -> if True

yongtang · 2016-04-12T04:28:49Z

@holdenk The Scala implementation has ben completed in SPARK-13963. I updated the description of this pull request to show the linkage between this issue (SPARK-14238) and SPARK-13963.

SparkQA · 2016-04-12T04:35:47Z

Test build #55587 has finished for PR 12079 at commit 9c2b4ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-12T08:29:35Z

python/pyspark/ml/feature.py


+    binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " +
+                   "This is useful for discrete probabilistic models that model binary events " +
+                   "rather than integer counts. (default: False)",


The style seems to be . Default False rather than . (default: False). @BryanCutler @holdenk thoughts?

Though I must say I'd prefer (default: X). across the board myself.

MLnick · 2016-04-12T08:50:36Z

A few minor comments, otherwise LGTM.

@holdenk @BryanCutler we could merge this and #12308, and then update the param to be shared (if we can do the different doc thing?).

yongtang · 2016-04-12T13:52:42Z

Thanks @MLnick I just updated the pull request to address several minor issues. With respect to . Default False vs . (default: False), I changed it to . Default False for now. But if you want to see (default: X) I can change it (including the rest of the file) to it as well.

SparkQA · 2016-04-12T13:58:57Z

Test build #55608 has finished for PR 12079 at commit 551cc6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-04-12T15:44:59Z

@holdenk @BryanCutler we could merge this and #12308, and then update the param to be shared (if we can do the different doc thing?).

I think that will be better and maybe then we can change the param to be shared on both the Scala and Python side.

holdenk · 2016-04-12T15:51:28Z

@BryanCutler / @yongtang That sounds reasonable :)

MLnick · 2016-04-12T18:45:57Z

As per @jkbradley's #12308 (comment), let's keep them separate params.

MLnick · 2016-04-14T19:25:35Z

jenkins retest this please

SparkQA · 2016-04-14T19:41:22Z

Test build #55839 has finished for PR 12079 at commit 551cc6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-14T19:54:36Z

LGTM. Merged to master.

yanboliang reviewed Mar 31, 2016
View reviewed changes

python/pyspark/ml/feature.py Outdated

Copy link

Contributor

yanboliang Mar 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the doc of the Param consistent with Scala.

yongtang added 3 commits April 11, 2016 07:31

Fix PEP8 errors.

dabaf89

Update comment blocks and args according to @yanboliang's comment.

829c87e

holdenk reviewed Apr 11, 2016
View reviewed changes

Minor fix in comment based on feedbacks from @holdenk and @BryanCutler.

9c2b4ab

MLnick reviewed Apr 12, 2016
View reviewed changes

Several comment area fixes based on comments from @MLnick.

551cc6e

asfgit closed this in bc748b7 Apr 14, 2016

yongtang deleted the SPARK-14238 branch April 14, 2016 20:15

MLnick mentioned this pull request Apr 15, 2016

[SPARK-14644][ML][PYSPARK] Turn Binary param into a shared param #12404

Closed

[SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib #12079

[SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib #12079

Uh oh!

Conversation

yongtang commented Mar 31, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MLnick commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

yanboliang Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

yongtang commented Apr 11, 2016

Uh oh!

SparkQA commented Apr 11, 2016

Uh oh!

holdenk commented Apr 11, 2016

Uh oh!

holdenk Apr 11, 2016

Choose a reason for hiding this comment

Uh oh!

yongtang Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

yongtang commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

MLnick Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Apr 12, 2016

Uh oh!

yongtang commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

BryanCutler commented Apr 12, 2016

Uh oh!

holdenk commented Apr 12, 2016

Uh oh!

MLnick commented Apr 12, 2016

Uh oh!

MLnick commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

MLnick commented Apr 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants