This repository contains helpful resources about word embeddings. It also includes methods to evaluate, visualize, and apply popular pretrained embeddings. A more detailed report on my findings can be found at .report.md
.
Goals
- Use various embedding techniques to embed words.
- Project the embeddings onto the 2-D space (using tool t-SNE) and discuss observations from the visualizations
- Applications of word embeddings on Downstream tasks like Sentiment Analysis
Python 3.8.18
data/ # store all the datasets/pretrained embeddings here
out/ # all the trained models will be here
SentimentAnalysis/ # Applications of trained embeddings on sentiment analysis
run.py # Gives some example how to train your own embeddings
visualize.ipynb # Gives some examples to visualize word embeddings
Here is a list of corpus that can be used for embedding. Note that some datasets are preprocessed while some have not been and require additional preprocessing befor you embed them
- Daily Dialog - A multi-turn open-domain English dialog dataset. It contains 13,118 dialogues.
- Wikipedia Dump
- Twitter-27B
- Common Crawl
- Google News Corpus
- Penn Tree Bank
- Book Corpus
- 中国闲聊数据集
- WMT-1翻译数据集(2018)
- 中国古诗数据集
- 二十九-近1GB的三千万聊天语料
- MNBVC
- NewsGroup Dataset or in json format.
Here is a list of pretrained embeddings and where you can find them. Some of them are pretrained on the datasets listed above.
- Word2Vec, Google News Embeddings - Word2vec embeddings (3 million 300-dimension english word vectors) trained on Google News corpus (3 billion running words)
- GloVe Wikipedia 2014 + Gigaword 5 - GloVe, 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors
- Glove Twitter - GloVe, 2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors
- NLPL word embeddings repository - Contains a ton of pretrained word embeddings, what corpus they were trained on, algorithms, and other parameters like vectgor size, vocabulary size, algorith, and lemmatization (or lack of).
A table.
Name | Type | Vector Size | Vocab size |
---|---|---|---|
GoogleNews | Word2Vec | 300 | 3,000,000 |
GloveTwitter | Glove | 50, 100, 200, 300 | 1,200,000 |
Common Crawl | Glove | 300 | 1,900,000 |
Common Crawl | Glove | 300 | 2,200,000 |
- Scatter Plots with reduction (TSE, PCA)
- Word Clouds
- One Hot Encoding
- Context-Free
- Word2vec
- GloVE
- Contextual
- Unidirectional
- Shallowly Bidirectional
- HPcA (High Performance Contextualized Attention-based)
- morphoRNNLM (Morphological Regularization Neural Network Language Model)
- LexVec (Lexicon-Enhanced Vector Representation)
- ConceptNet
- HDC/PDC ()
There are many basic metrics to evaluate the word embeddings.
Analogy Datasets
Similarity Datasets
- SemEval2012 - SemEval2012 dataset for relational similarity orginal
- WS353 or wordsim354 - WS353 dataset for testing attributional and relatedness similarity
- SimLex999 - SimLex999 dataset for testing attributional similarity
- TR9856 - TR9856 dataset for testing multi-word term relatedness
- MTurk - MTurk dataset for testing attributional similarity
- RG65 - Rubenstein and Goodenough dataset for testing attributional and relatedness similarity
- RW - Rare Words dataset for testing attributional similarity
- MEN - MEN dataset for testing similarity and relatedness
Categorization Datasets
- AP - Almuhareb and Abdulrahman categorization dataset
- BLESS - Baroni and Marco categorization dataset
- Battig - 1969 Battig dataset
- ESSLI 2c
- ESSLI 2b
- ESSLI 1a
Sources:
The google analogy benchmark has a total of 19544 questions across 14 different categories.
Sources:
The MSR analogy benchmark has a total of 8000 questions across 16 categoires.
Source: https://arxiv.org/abs/1407.1640
The main task for WordRep is analogical reasoning.
Source: https://aclanthology.org/S12-1047/
The main task for SemEval2012 is finding degree of similarity.
- Text Classification
- Named Entity Recognition (NER)
- Machine Translation
- Information Retrieval
- Question Answering
- Semantic Similarity and Clustering
- Text Generation
- Similarity and analogy
- Pre-training models
- Yelp - An all-purpose dataset for learning
- Twitter US Airline Sentiment - Analyze how travelers in February 2015 expressed their feelings on Twitter
Feel free to add to this list!