Skip to content

How to train your own word2vec model for use with ml5.js

License

Notifications You must be signed in to change notification settings

ml5js/training-word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training

Python Environment

Requirements

pip install -r requirements.txt

Train the model

  1. Clone this repository or download this python script
git clone https://github.com/ml5js/training-word2vec/
  1. The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run train.py with the name of the file or folder.

Example:

python train.py file.xt
python train.py files/
  1. The script will output a vectors.txt and vectors.json file, however, if you would like to specify an output file name you can use the additional argument -o for that.
python train.py data.txt -o output.json
  1. The output JSON file can be used now with the ml5.js word2vec examples.

Advanced tokenization

The default tokenizer is very basic. You can ask the script to use NLTK's tokenizer with the --tokenizer argument.

Additionally, the script can remove stop words.

python train.py files/ -t nltk --remove-stop-words

About

How to train your own word2vec model for use with ml5.js

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages