Skip to content

Conversation

@BryanCutler
Copy link
Member

Added binary toggle param to CountVectorizer feature transformer in PySpark.

Created a unit test for using CountVectorizer with the binary toggle on.

@BryanCutler
Copy link
Member Author

cc @MLnick

@holdenk
Copy link
Contributor

holdenk commented Apr 11, 2016

Since we seem to be adding the same param to multiple models, would it maybe make sense to make this a shared param?

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55541 has finished for PR 12308 at commit 7d362ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Since we seem to be adding the same param to multiple models, would it maybe make sense to make this a shared param?

Maybe, I'm not sure if there are other uses besides HashingTF but I know @MLnick was proposing a more general feature hasher that seems like it would reduce a lot more code duplication..

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

Will take a look

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

@holdenk @BryanCutler I'd say we could make binary shared, but the only thing is currently the doc is a bit different between them (the doc for CountVectorizer mentions the binarization occurs after minTF is applied, while minTF doesn't exist for HashingTF). So if there's a way to easily make them shared but have different doc, then go ahead.

We could later add minTF to HashingTF I guess, in which case it can definitely be a shared param.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

LGTM otherwise.

@jkbradley
Copy link
Member

+1 for not sharing the Param if the docs (and semantics) differ

outputCol=None):
"""
setParams(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, inputCol=None, outputCol=None)
setParams(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a backslash at the end. Please check the generated HTML doc.

@BryanCutler
Copy link
Member Author

Need a backslash at the end. Please check the generated HTML doc

This should have cause the doc checks to fail right?

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55633 has finished for PR 12308 at commit 3907475.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 12, 2016

@BryanCutler it seems that (due to a bug introduced during code review) right now we only hault on doc build errors, not warnings as originally intended. Thanks for noticing this I'll make a follow up JIRA to deal with this.

@BryanCutler
Copy link
Member Author

It looks like the option to treat warnings as errors is there, it just gets overwritten in the makefile

@holdenk
Copy link
Contributor

holdenk commented Apr 12, 2016

Yup I've got a fix but going to cleanup the warnings that got in the meantime too.

@MLnick
Copy link
Contributor

MLnick commented Apr 13, 2016

@mengxr @jkbradley anything further?

@jkbradley
Copy link
Member

LGTM, feel free to merge, thanks!

@asfgit asfgit closed this in c5172f8 Apr 14, 2016
@MLnick
Copy link
Contributor

MLnick commented Apr 14, 2016

Merged to master. Thanks!

@BryanCutler BryanCutler deleted the binary-param-python-CountVectorizer-SPARK-13967 branch December 2, 2016 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants