Van Paridon & Thompson (2019) introduces pretrained embeddings and precomputed word/bigram/trigram frequencies in 55 languages. The files can be downloaded from the links in this table. Word vectors trained on subtitles are available, as well as vectors trained on Wikipedia, and a combination of subtitles and Wikipedia (for best predictive performance).
This repository contains the subs2vec module, a number of Python 3.7 scripts and command line tools to evaluate a set of word vectors on semantic similarity, semantic and syntactic analogy, and lexical norm prediction tasks. In addition, the subs2vec.py
script will take an OpenSubtitles archive or Wikipedia and go through all the steps to train a fastText model and produce word vectors as used in the paper associated with this repository.
Psycholinguists may be especially interested norms
script, which evaluates the lexical norm prediction performance of a set of word vectors, but can also be used to predict lexical norms for un-normed words. For a more detailed explanation see the how to use -> extending lexical norms section.
The scripts in this repository require Python 3.7 and some additional libraries that are easily installed through pip. (If you want to use the subs2vec.py
script to train your own word embeddings, you will also need compiled fastText and word2vec binaries.)
If you use any of the subs2vec code and/or pretrained models, please cite the preprint (Van Paridon & Thompson, 2019).
subs2vec is available through pip, installing is as easy as running:
python3 -m pip install subs2vec
Any missing dependencies should be installed automatically.
Each submodules of subs2vec can then be run as a command line tool using the -m flag:
python3 -m subs2vec.submodule_name
To evaluate word embeddings on analogies, semantic similarity, or lexical norm prediction as in Van Paridon & Thompson (2019), use:
python3 -m subs2vec.analogies fr french_word_vectors.vec
python3 -m subs2vec.similarities fr french_word_vectors.vec
python3 -m subs2vec.norms fr french_word_vectors.vec
subs2vec uses the two-letter ISO language codes, so French in the example is fr
, English would be en
, German would be de
, etc.
All datasets used for evaluation, including the lexical norms, are stored in subs2vec/evaluation/datasets/
.
Results from Van Paridon & Thompson (2019) are in subs2vec/evaluation/article_results/
.
To extend lexical norms (either norms you have collected yourself, or norms provided in this repository) use:
python3 -m subs2vec.norms fr french_word_vectors.vec --extend_norms=french_norms_file.txt
The norms file should be a tab-separated text file, with the first line containing column names and the column containing the words should be called word
. Unobserved cells should be left empty. If you are unsure how to generate this file, you can create your list in Excel and then use Save as... tab-delimited text
.
For an overview of norms that come included in the repo (and their authors), see this list. For the norms datasets themselves, look inside this directory.
The subtitle corpus used to train subs2vec was also used to compile the word frequencies in SUBTLEX. That same corpus can of course be used to compile bigram and trigram frequencies as well.
To extract word, bigram, or trigram frequencies from a text file yourself, fr.txt
for instance, use:
python3 -m subs2vec.frequencies fr.txt
In general, however, we recommend downloading the precompiled frequencies files from [language archive] and looking frequencies up in those.
When looking up frequencies for specific words, bigrams, or trigrams, you may find that you cannot open the frequencies files (they can be very large). To retrieve items of interest use:
python3 -m subs2vec.lookup frequencies_file.tsv list_of_items.txt
Your list of items should be a simple text file, with each item you want to look up on its own line.
This lookup scripts works for looking up frequencies, but it finds lines in any plain text file, so it works for looking up word vectors in .vec files as well.
subs2vec comes with a module that removes duplicate lines from text files. We used it to remove duplicate lines from training corpora, but it works for any text file.
To remove duplicates from fr.txt
for example, use:
python3 -m subs2vec.deduplicate fr.txt
If you want to reproduce models as used in Van Paridon & Thompson (2019), you can use the train_model
module.
For instance, the steps to create a subtitle corpus are:
- Download a corpus:
python3 -m subs2vec.download fr subs
- Clean the corpus:
python3 -m subs2vec.clean_subs fr --strip --join
- Deduplicate the lines in the corpus:
python3 -m subs2vec.deduplicate fr.txt
- Train a fastText model on the subtitle corpus:
python3 -m subs2vec.train_model fr subs dedup.fr.txt
This last step requires the binaries for fastText and word2phrase (part of word2vec) to be downloaded, built, and discoverable on your system (i.e., on your PATH).
For more detailed training options:
python3 -m subs2vec.train_model --help
For more detailed documentation of the package modules and API, see subs2vec.readthedocs.io
This table contains links to the top 1 million word vectors in each language, as well all vectors, model binaries, and the word, bigram, and trigram frequencies in the subtitle and Wikipedia corpora. If you use these pretrained vectors/models, please cite the preprint (Van Paridon & Thompson, 2019).