@@ -9,4 +9,65 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction
99
1010## Word2Vec
1111
12- ## TFIDF
12+ Word2Vec computes distributed vector representation of words. The main advantage of the distributed
13+ representations is that similar words are close in the vector space, which makes generalization to
14+ novel patterns easier and model estimation more robust. Distributed vector representation is
15+ showed to be useful in many natural language processing applications such as named entity
16+ recognition, disambiguation, parsing, tagging and machine translation.
17+
18+ ### Model
19+
20+ In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is
21+ to learn word vector representations that are good at predicting its context in the same sentence.
22+ Mathematically, given a sequence of training words `$w_1, w_2, \dots, w_T$`, the objective of the
23+ skip-gram model is to maximize the average log-likelihood
24+ `\[
25+ \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
26+ \]`
27+ where $k$ is the size of the training window.
28+
29+ In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are
30+ vector representations of $w$ as word and context respectively. The probability of correctly
31+ predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
32+ `\[
33+ p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}
34+ \]`
35+ where $V$ is the vocabulary size.
36+
37+ The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
38+ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
39+ we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
40+ $O(\log(V))$
41+
42+ ### Example
43+
44+ The example below demonstrates how to load a text file, parse it as an RDD of `Seq[String]`,
45+ construct a `Word2Vec` instance and then fit a `Word2VecModel` with the input data. Finally,
46+ we display the top 40 synonyms of the specified word. To run the example, first download
47+ the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your preferred directory.
48+ Here we assume the extracted file is `text8` and in same directory as you run the spark shell.
49+
50+ <div class="codetabs">
51+ <div data-lang="scala">
52+ {% highlight scala %}
53+ import org.apache.spark._
54+ import org.apache.spark.rdd._
55+ import org.apache.spark.SparkContext._
56+ import org.apache.spark.mllib.feature.Word2Vec
57+
58+ val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
59+
60+ val word2vec = new Word2Vec()
61+
62+ val model = word2vec.fit(input)
63+
64+ val synonyms = model.findSynonyms("china", 40)
65+
66+ for((synonym, cosineSimilarity) <- synonyms) {
67+ println(s"$synonym $cosineSimilarity")
68+ }
69+ {% endhighlight %}
70+ </div>
71+ </div>
72+
73+ ## TFIDF
0 commit comments