Skip to content

Commit 72594f0

Browse files
committed
add cv to tf doc
1 parent 3394b12 commit 72594f0

File tree

4 files changed

+19
-4
lines changed

4 files changed

+19
-4
lines changed

docs/ml-features.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,19 @@ This section covers algorithms for working with features, roughly divided into t
2222

2323
[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
2424

25-
**TF**: `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
26-
The algorithm combines Term Frequency (TF) counts with the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
25+
**TF**: Both `HashingTF` and `CountVectorizer` can be used to get the term frequency.
2726

28-
**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
27+
`HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
28+
fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
29+
The algorithm combines Term Frequency (TF) counts with the
30+
[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
31+
32+
`CountVectorizer` converts text documents to vectors of token counts. Refer to [CountVectorizer
33+
](ml-features.html#countvectorizer) for more details.
34+
35+
**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The
36+
`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.
37+
Intuitively, it down-weights columns which appear frequently in a corpus.
2938

3039
Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
3140

examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ public static void main(String[] args) {
6363
.setOutputCol("rawFeatures")
6464
.setNumFeatures(numFeatures);
6565
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
66+
// alternatively, CountVectorizer can also be used to get term frequency vectors
67+
6668
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
6769
IDFModel idfModel = idf.fit(featurizedData);
6870
Dataset<Row> rescaledData = idfModel.transform(featurizedData);

examples/src/main/python/ml/tf_idf_example.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@
3737
wordsData = tokenizer.transform(sentenceData)
3838
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
3939
featurizedData = hashingTF.transform(wordsData)
40+
# alternatively, CountVectorizer can also be used to get term frequency vectors
41+
4042
idf = IDF(inputCol="rawFeatures", outputCol="features")
4143
idfModel = idf.fit(featurizedData)
4244
rescaledData = idfModel.transform(featurizedData)

examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ package org.apache.spark.examples.ml
2020

2121
import org.apache.spark.{SparkConf, SparkContext}
2222
// $example on$
23-
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
23+
import org.apache.spark.ml.feature.{CountVectorizer, HashingTF, IDF, Tokenizer}
2424
// $example off$
2525
import org.apache.spark.sql.SQLContext
2626

@@ -43,6 +43,8 @@ object TfIdfExample {
4343
val hashingTF = new HashingTF()
4444
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
4545
val featurizedData = hashingTF.transform(wordsData)
46+
// alternatively, CountVectorizer can also be used to get term frequency vectors
47+
4648
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
4749
val idfModel = idf.fit(featurizedData)
4850
val rescaledData = idfModel.transform(featurizedData)

0 commit comments

Comments
 (0)