Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Apr 17, 2016

What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

How was this patch tested?

unit tests and doc generation

@SparkQA
Copy link

SparkQA commented Apr 17, 2016

Test build #56049 has finished for PR 12454 at commit 72594f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 17, 2016

OK by me


import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we add CountVectorizer as an import here and not in JavaTfIdfExample? Since we aren't referencing it in code in either I'd probably leave it out of both personally (but either way consistency would be best).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll send an update.

`CountVectorizer` converts text documents to vectors of token counts. Refer to [CountVectorizer
](ml-features.html#countvectorizer) for more details.

**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which fits on -> which is fit on

@MLnick
Copy link
Contributor

MLnick commented Apr 18, 2016

I had originally thought we could include some code for CountVectorizer in the example... but it might be a bit verbose then, so this looks better actually.

Made a few small doc comments, otherwise LGTM.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 18, 2016

Thanks for the careful review. Updated according to the comments.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56075 has finished for PR 12454 at commit cb84f5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 20, 2016

Merged to master

@asfgit asfgit closed this in ed9d803 Apr 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants