Skip to content

Conversation

@yongtang
Copy link
Contributor

What changes were proposed in this pull request?

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.

Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.

How was this patch tested?

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

@MLnick
Copy link
Contributor

MLnick commented Mar 31, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54623 has finished for PR 12079 at commit e58d1a2.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54631 has finished for PR 12079 at commit 1e24a68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the doc of the Param consistent with Scala.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54642 has finished for PR 12079 at commit a71f59b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…HashingTF in ML & MLlib

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib.
If this toggle is set, then all non-zero counts will be set to 1.

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML
and one for HashingTF in PySpark's MLLib.
@yongtang
Copy link
Contributor Author

Rebased to fix conflicts.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55524 has finished for PR 12079 at commit 829c87e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 11, 2016

One minor note:Often we want to go with Scala first then Python, but in either direction if we are only doing one at a time it can be good practice to create either a follow up JIRA or a subtask on the existing JIRA to also expose the implementation in the other language.

.. versionadded:: 1.3.0
"""

binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to mention the default value here (namely false).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @holdenk this issue has been addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looking at the incoming PRs it seems there is a second PR also adding a binary feature to another model - it might make sense to move this to a shared param instead of having it be per-model (although it will require coordination with the other PR timing wise).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if true -> if True

@yongtang
Copy link
Contributor Author

@holdenk The Scala implementation has ben completed in SPARK-13963. I updated the description of this pull request to show the linkage between this issue (SPARK-14238) and SPARK-13963.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55587 has finished for PR 12079 at commit 9c2b4ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


binary = Param(Params._dummy(), "binary", "If true, all non zero counts are set to 1. " +
"This is useful for discrete probabilistic models that model binary events " +
"rather than integer counts. (default: False)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style seems to be . Default False rather than . (default: False). @BryanCutler @holdenk thoughts?

Though I must say I'd prefer (default: X). across the board myself.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

A few minor comments, otherwise LGTM.

@holdenk @BryanCutler we could merge this and #12308, and then update the param to be shared (if we can do the different doc thing?).

@yongtang
Copy link
Contributor Author

Thanks @MLnick I just updated the pull request to address several minor issues. With respect to . Default False vs . (default: False), I changed it to . Default False for now. But if you want to see (default: X) I can change it (including the rest of the file) to it as well.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55608 has finished for PR 12079 at commit 551cc6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

@holdenk @BryanCutler we could merge this and #12308, and then update the param to be shared (if we can do the different doc thing?).

I think that will be better and maybe then we can change the param to be shared on both the Scala and Python side.

@holdenk
Copy link
Contributor

holdenk commented Apr 12, 2016

@BryanCutler / @yongtang That sounds reasonable :)

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

As per @jkbradley's #12308 (comment), let's keep them separate params.

@MLnick
Copy link
Contributor

MLnick commented Apr 14, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55839 has finished for PR 12079 at commit 551cc6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 14, 2016

LGTM. Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants