-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #56049 has finished for PR 12454 at commit
|
|
OK by me |
|
|
||
| import org.apache.spark.{SparkConf, SparkContext} | ||
| // $example on$ | ||
| import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we add CountVectorizer as an import here and not in JavaTfIdfExample? Since we aren't referencing it in code in either I'd probably leave it out of both personally (but either way consistency would be best).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll send an update.
docs/ml-features.md
Outdated
| `CountVectorizer` converts text documents to vectors of token counts. Refer to [CountVectorizer | ||
| ](ml-features.html#countvectorizer) for more details. | ||
|
|
||
| **IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which fits on -> which is fit on
|
I had originally thought we could include some code for Made a few small doc comments, otherwise LGTM. |
|
Thanks for the careful review. Updated according to the comments. |
|
Test build #56075 has finished for PR 12454 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
How was this patch tested?
unit tests and doc generation