Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Apr 15, 2016

What changes were proposed in this pull request?

Change the binary param into a shared param.

How was this patch tested?

Existing unit tests

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55872 has finished for PR 12404 at commit edbd652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer(JavaEstimator, HasBinary, HasInputCol, HasOutputCol, JavaMLReadable,
    • class HashingTF(JavaTransformer, HasBinary, HasInputCol, HasOutputCol, HasNumFeatures,


class HasBinary(Params):
"""
Mixin for param binary: If True, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this didn't fail our style checker? did we break the pep8 checker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From toxi.ini we've excluded cloudpickle.py,heapq3.py, and shared.py, we've also done similar exclusions on the Scala side for the auto generated params.

@MLnick
Copy link
Contributor

MLnick commented Apr 15, 2016

Hey @holdenk, in #12079 (comment) and #12308 (comment), we decided to keep them separate for now, since the doc for binary in CountVectorizer mentions that the binarization occurs after minTF is applied. In HashingTF this doesn't apply as there is no minTF parameter.

So unless we can easily have different doc string, I think it might be best to not separate out binary into a shared param yet. We could look at adding minTF semantics to hashingTF, in which case it would then make a lot of sense to have this as a shared param.

@holdenk
Copy link
Contributor Author

holdenk commented Apr 15, 2016

So I thought the unified doc string of:

If true, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.

might make sense in the two cases where we are using it (since minTF is described as a filter). If you think this isn't clear enough with the minTF case on HashingTF I'm happy to close it for now and revisit if we add minTF to CountVectorizer.

@MLnick
Copy link
Contributor

MLnick commented Apr 15, 2016

ah - sorry I missed that. Ok, yeah this makes sense if we unify the doc string appropriately. Perhaps something like If true, all non-zero counts (after any filters are applied) ...?

@holdenk
Copy link
Contributor Author

holdenk commented Apr 15, 2016

Sure that sounds like a clearer docstring :)

@jkbradley
Copy link
Member

I prefer to keep them separate. There isn't much benefit to combining 2 instances, and the doc becomes a bit less clear.

@MLnick
Copy link
Contributor

MLnick commented Apr 15, 2016

Fair enough. We can revisit if more estimators use the param, or if we add
minTF to HashingTF.
On Fri, 15 Apr 2016 at 22:54, jkbradley [email protected] wrote:

I prefer to keep them separate. There isn't much benefit to combining 2
instances, and the doc becomes a bit less clear.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#12404 (comment)

@holdenk
Copy link
Contributor Author

holdenk commented Apr 15, 2016

Sounds good if people prefer to keep them separate I'll close this out.

@holdenk holdenk closed this Apr 15, 2016
@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55961 has finished for PR 12404 at commit b702e72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants