Skip to content

Latest commit

 

History

History
67 lines (53 loc) · 3.75 KB

README.md

File metadata and controls

67 lines (53 loc) · 3.75 KB

word2vec-rs

Simple implementation of Word2Vec in Rust

This is an implementation of Word2Vec in Rust, from the ground up, and without using any available ML libraries or purpose-built crates such as finalfusion which is a purpose-built Rust toolkit for working with embeddings.

The word2vec-rs binary trains a hidden layer via a basic feedforward neural network with a softmax regression classifier. It then extracts embeddings from the model's weights matrix and compares quality of similarity predictions between cosine-similarity and forward-propagation methods for given query terms.

Provided in this repository is a python script which is useful for parsing and sanitizing Wikipedia articles, which can be used as training data. There's also a script for plotting graphs based on cross-entropy values in file, separated by newlines. Also provided are same example corpora used for testing.

Cross-entropy loss visualization from 175 training epochs

This implementation is discussed in this blog post.

  ~>> ./target/release/word2vec-rs --input ./corpora/san_francisco.txt --predict gate --load ./model.out --no_train --analogy gate,bridge,golden
  Most likely nearby tokens to [gate] via nn forward propagation:

  [0]: golden     | probability: 0.1078739
  [1]: park       | probability: 0.0241675
  [2]: bay        | probability: 0.0179222
  [3]: city       | probability: 0.0140273
  [4]: north      | probability: 0.0136982
  [5]: area       | probability: 0.0123178
  [6]: bridge     | probability: 0.0097368
  [7]: county     | probability: 0.0084362
  [8]: california | probability: 0.0058315
  [9]: mission    | probability: 0.0057265

  Most similar nearby tokens to [gate] via embeddings cosine similarity:

  [0]: bridge    | similarity: 0.9279431
  [1]: golden    | similarity: 0.9071595
  [2]: park      | similarity: 0.8965557
  [3]: north     | similarity: 0.8283078
  [4]: protect	 | similarity: 0.7988443
  [5]: national  | similarity: 0.7634193
  [6]: makeshift | similarity: 0.7556177
  [7]: jfk       | similarity: 0.7491959
  [8]: fort      | similarity: 0.7381749
  [9]: tent      | similarity: 0.7358739

  Computing analogy for "gate" - "bridge" + "golden" = ?

  best word analogy: "gate" - "bridge" + "golden" = "park" (similarity 0.86515)

Information about program arguments and flags:

  Usage: word2vec-rs [OPTIONS]

  Options:
        --analogy <a,b,c>        ask the model for a word analogy; A is to B as C is to ???
        --embedding_size <INT>   the dimensionality of the trained embeddings; defaults to 128
        --entropy                if true, prints out the cross-entropy loss after each training epoch
        --epochs <INT>           the number of training epochs to run within this invocation
        --input <FILE_PATH>      path to the training text to ingest
        --learning_rate <FLOAT>  the learning rate; defaults to 0.001
        --load <FILE_PATH>       path to a previously-written file containing model weight vectors
        --no_train               if true, skips any training and only executes query arguments
        --predict <WORD>         prints the top-ten most similar words to this argument, produced by forward propagation 
                                        and embedding cosine similarity
        --save                   if true, will save computed model weights to ./model.out, overwriting any existing local file
        --window_size <INT>      the sliding window size for token co-occurrence; defaults to 4
    -h, --help                   Print help
    -V, --version                Print version