[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

hhbyyh · 2016-04-17T08:54:31Z

What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

How was this patch tested?

unit tests and doc generation

SparkQA · 2016-04-17T09:20:06Z

Test build #56049 has finished for PR 12454 at commit 72594f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-04-17T18:13:39Z

OK by me

holdenk · 2016-04-18T02:33:20Z

examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala


 import org.apache.spark.{SparkConf, SparkContext}
 // $example on$
-import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}


Why did we add CountVectorizer as an import here and not in JavaTfIdfExample? Since we aren't referencing it in code in either I'd probably leave it out of both personally (but either way consistency would be best).

Thanks. I'll send an update.

MLnick · 2016-04-18T07:09:23Z

docs/ml-features.md

+`CountVectorizer` converts text documents to vectors of token counts. Refer to [CountVectorizer
+](ml-features.html#countvectorizer) for more details.
+
+**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`.  The 


which fits on -> which is fit on

MLnick · 2016-04-18T07:19:15Z

I had originally thought we could include some code for CountVectorizer in the example... but it might be a bit verbose then, so this looks better actually.

Made a few small doc comments, otherwise LGTM.

hhbyyh · 2016-04-18T12:24:10Z

Thanks for the careful review. Updated according to the comments.

SparkQA · 2016-04-18T17:46:09Z

Test build #56075 has finished for PR 12454 at commit cb84f5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-04-20T10:45:34Z

Merged to master

add cv to tf doc

72594f0

Merge remote-tracking branch 'upstream/master' into tfdoc

8b37c02

holdenk reviewed Apr 18, 2016
View reviewed changes

revert import change

71c225f

MLnick reviewed Apr 18, 2016
View reviewed changes

hhbyyh added 3 commits April 18, 2016 08:20

doc refine

ad4a033

Merge branch 'tfdoc' of https://github.com/hhbyyh/spark into tfdoc

762dbbd

Merge remote-tracking branch 'upstream/master' into tfdoc

cb84f5e

asfgit closed this in ed9d803 Apr 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

Uh oh!

hhbyyh commented Apr 17, 2016

Uh oh!

SparkQA commented Apr 17, 2016

Uh oh!

srowen commented Apr 17, 2016

Uh oh!

holdenk Apr 18, 2016

Uh oh!

hhbyyh Apr 18, 2016

Uh oh!

MLnick Apr 18, 2016

Uh oh!

MLnick commented Apr 18, 2016

Uh oh!

hhbyyh commented Apr 18, 2016

Uh oh!

SparkQA commented Apr 18, 2016

Uh oh!

srowen commented Apr 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

Uh oh!

Conversation

hhbyyh commented Apr 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 17, 2016

Uh oh!

srowen commented Apr 17, 2016

Uh oh!

holdenk Apr 18, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh Apr 18, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 18, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Apr 18, 2016

Uh oh!

hhbyyh commented Apr 18, 2016

Uh oh!

SparkQA commented Apr 18, 2016

Uh oh!

srowen commented Apr 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants