See the GitHub Project Page for a high-level overview of the project.
This repository contains python modules to learn word vectors from raw text documents using TensorFlow. There are 4 primary modules:
- docload.py: Loads and processes raw text documents in preparation for model training. Has a few basic "hooks" to make loading Project Gutenberg books easy. Documents are represented as integer numpy arrays.
- windowmodel.py: Contains the TensorFlow graph and methods to train the model, return word vectors and make predictions. Initial call returns WindowModel object. Also contains static method to take integer numpy array and format for training.
- wordvector.py: Explore word vectors returned by WindowModel.train(). Finds closest words based on a variety of distance metrics. Has method to predict analogies (i.e. A is to B as C is to D). Also includes routine to project word vectors to 2D space using t-SNE.
- plot_util.py: Only 1 plot utility at this time: plot learning curves from training.
- sherlock.ipynb: Uses above modules to load 3 Sherlock Holmes books, train the neural net and do some basic exploration of the results.
- tune_*.ipynb: Hyper-parameter tuning for sherlock.ipynb model. Explore different layer sizes, learning rates, optimizers and weight initialization.
- word_frequency.ipynb: Plot word frequencies from 3 Sherlock Holmes books and overlay log-uniform distribution. Noise contrastive estimation routine (tf.nn.nce_loss) in Tensorflow assume log-uniform word frequency distribution.
Unit tests for Python modules.