[SPARK-13967] [PYSPARK][ML] Added binary Param to Python CountVectorizer #12308

BryanCutler · 2016-04-11T22:13:38Z

Added binary toggle param to CountVectorizer feature transformer in PySpark.

Created a unit test for using CountVectorizer with the binary toggle on.

BryanCutler · 2016-04-11T22:13:49Z

cc @MLnick

holdenk · 2016-04-11T22:21:09Z

Since we seem to be adding the same param to multiple models, would it maybe make sense to make this a shared param?

SparkQA · 2016-04-11T22:27:06Z

Test build #55541 has finished for PR 12308 at commit 7d362ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-04-11T23:16:38Z

Since we seem to be adding the same param to multiple models, would it maybe make sense to make this a shared param?

Maybe, I'm not sure if there are other uses besides HashingTF but I know @MLnick was proposing a more general feature hasher that seems like it would reduce a lot more code duplication..

MLnick · 2016-04-12T08:09:44Z

Will take a look

MLnick · 2016-04-12T08:13:12Z

@holdenk @BryanCutler I'd say we could make binary shared, but the only thing is currently the doc is a bit different between them (the doc for CountVectorizer mentions the binarization occurs after minTF is applied, while minTF doesn't exist for HashingTF). So if there's a way to easily make them shared but have different doc, then go ahead.

We could later add minTF to HashingTF I guess, in which case it can definitely be a shared param.

MLnick · 2016-04-12T08:21:44Z

LGTM otherwise.

jkbradley · 2016-04-12T18:18:41Z

+1 for not sharing the Param if the docs (and semantics) differ

mengxr · 2016-04-12T18:32:59Z

python/pyspark/ml/feature.py

+                  outputCol=None):
        """
-        setParams(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, inputCol=None, outputCol=None)
+        setParams(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None,


Need a backslash at the end. Please check the generated HTML doc.

BryanCutler · 2016-04-12T19:12:49Z

Need a backslash at the end. Please check the generated HTML doc

This should have cause the doc checks to fail right?

SparkQA · 2016-04-12T19:14:21Z

Test build #55633 has finished for PR 12308 at commit 3907475.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-12T20:18:39Z

@BryanCutler it seems that (due to a bug introduced during code review) right now we only hault on doc build errors, not warnings as originally intended. Thanks for noticing this I'll make a follow up JIRA to deal with this.

BryanCutler · 2016-04-12T20:31:59Z

It looks like the option to treat warnings as errors is there, it just gets overwritten in the makefile

holdenk · 2016-04-12T20:49:26Z

Yup I've got a fix but going to cleanup the warnings that got in the meantime too.

MLnick · 2016-04-13T18:38:06Z

@mengxr @jkbradley anything further?

jkbradley · 2016-04-13T22:41:50Z

LGTM, feel free to merge, thanks!

MLnick · 2016-04-14T18:50:27Z

Merged to master. Thanks!

[SPARK-13967] Added binary Param to Python CountVectorizer and unit test

7d362ed

MLnick mentioned this pull request Apr 12, 2016

[SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib #12079

Closed

mengxr reviewed Apr 12, 2016
View reviewed changes

fixed doc warning

3907475

asfgit closed this in c5172f8 Apr 14, 2016

MLnick mentioned this pull request Apr 15, 2016

[SPARK-14644][ML][PYSPARK] Turn Binary param into a shared param #12404

Closed

BryanCutler deleted the binary-param-python-CountVectorizer-SPARK-13967 branch December 2, 2016 00:59

[SPARK-13967] [PYSPARK][ML] Added binary Param to Python CountVectorizer #12308

[SPARK-13967] [PYSPARK][ML] Added binary Param to Python CountVectorizer #12308

Uh oh!

Conversation

BryanCutler commented Apr 11, 2016

Uh oh!

BryanCutler commented Apr 11, 2016

Uh oh!

holdenk commented Apr 11, 2016

Uh oh!

SparkQA commented Apr 11, 2016

Uh oh!

BryanCutler commented Apr 11, 2016

Uh oh!

MLnick commented Apr 12, 2016

Uh oh!

MLnick commented Apr 12, 2016

Uh oh!

MLnick commented Apr 12, 2016

Uh oh!

jkbradley commented Apr 12, 2016

Uh oh!

mengxr Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

holdenk commented Apr 12, 2016

Uh oh!

BryanCutler commented Apr 12, 2016

Uh oh!

holdenk commented Apr 12, 2016

Uh oh!

MLnick commented Apr 13, 2016

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

MLnick commented Apr 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants