Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Downstream		Downstream
__pycache__		__pycache__
images		images
.gitignore		.gitignore
embedding.py		embedding.py
evaluate.ipynb		evaluate.ipynb
load.py		load.py
readme.md		readme.md
report.md		report.md
run.py		run.py
training.py		training.py
utils.py		utils.py
visualize.ipynb		visualize.ipynb
vocabulary.py		vocabulary.py

Repository files navigation

Word Embedding

This repository contains helpful resources about word embeddings. It also includes methods to evaluate, visualize, and apply popular pretrained embeddings. A more detailed report on my findings can be found at .report.md.

Background

Goals

Use various embedding techniques to embed words.
Project the embeddings onto the 2-D space (using tool t-SNE) and discuss observations from the visualizations
Applications of word embeddings on Downstream tasks like Sentiment Analysis

Environment

Python 3.8.18

Files

data/               # store all the datasets/pretrained embeddings here
out/                # all the trained models will be here
SentimentAnalysis/  # Applications of trained embeddings on sentiment analysis
run.py              # Gives some example how to train your own embeddings
visualize.ipynb     # Gives some examples to visualize word embeddings

Datasets

Here is a list of corpus that can be used for embedding. Note that some datasets are preprocessed while some have not been and require additional preprocessing befor you embed them

Daily Dialog - A multi-turn open-domain English dialog dataset. It contains 13,118 dialogues.
Wikipedia Dump
Twitter-27B
Common Crawl
Google News Corpus
Penn Tree Bank
Book Corpus
中国闲聊数据集
WMT-1翻译数据集(2018)
中国古诗数据集
二十九-近1GB的三千万聊天语料
MNBVC
NewsGroup Dataset or in json format.

Pretrained Embeddings/Vectors

Here is a list of pretrained embeddings and where you can find them. Some of them are pretrained on the datasets listed above.

Word2Vec, Google News Embeddings - Word2vec embeddings (3 million 300-dimension english word vectors) trained on Google News corpus (3 billion running words)
GloVe Wikipedia 2014 + Gigaword 5 - GloVe, 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors
Glove Twitter - GloVe, 2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors
NLPL word embeddings repository - Contains a ton of pretrained word embeddings, what corpus they were trained on, algorithms, and other parameters like vectgor size, vocabulary size, algorith, and lemmatization (or lack of).

A table.

Name	Type	Vector Size	Vocab size
GoogleNews	Word2Vec	300	3,000,000
GloveTwitter	Glove	50, 100, 200, 300	1,200,000
Common Crawl	Glove	300	1,900,000
Common Crawl	Glove	300	2,200,000

Visualizaiton Methods

Scatter Plots with reduction (TSE, PCA)
Word Clouds

Embedding Methods

One Hot Encoding
Context-Free
- Word2vec
- GloVE
Contextual
- Unidirectional
- Shallowly Bidirectional
- HPcA (High Performance Contextualized Attention-based)
- morphoRNNLM (Morphological Regularization Neural Network Language Model)
- LexVec (Lexicon-Enhanced Vector Representation)
ConceptNet
HDC/PDC ()

Evaluations

There are many basic metrics to evaluate the word embeddings.

Overview

Analogy Datasets

Similarity Datasets

SemEval2012 - SemEval2012 dataset for relational similarity orginal
WS353 or wordsim354 - WS353 dataset for testing attributional and relatedness similarity
SimLex999 - SimLex999 dataset for testing attributional similarity
TR9856 - TR9856 dataset for testing multi-word term relatedness
MTurk - MTurk dataset for testing attributional similarity
RG65 - Rubenstein and Goodenough dataset for testing attributional and relatedness similarity
RW - Rare Words dataset for testing attributional similarity
MEN - MEN dataset for testing similarity and relatedness

Categorization Datasets

AP - Almuhareb and Abdulrahman categorization dataset
BLESS - Baroni and Marco categorization dataset
Battig - 1969 Battig dataset
ESSLI 2c
ESSLI 2b
ESSLI 1a

Google Analogy Benchmark

Sources:

The google analogy benchmark has a total of 19544 questions across 14 different categories.

MSR Analogy Benchmark

Sources:

The MSR analogy benchmark has a total of 8000 questions across 16 categoires.

Wordrep Benchmark

Source: https://arxiv.org/abs/1407.1640

The main task for WordRep is analogical reasoning.

SemEval2012 Benchmark

Source: https://aclanthology.org/S12-1047/

The main task for SemEval2012 is finding degree of similarity.

Applications

Text Classification
Named Entity Recognition (NER)
Machine Translation
Information Retrieval
Question Answering
Semantic Similarity and Clustering
Text Generation
Similarity and analogy
Pre-training models

Downstream Tasks Dataset

Yelp - An all-purpose dataset for learning

Sentiment Analysis

Twitter US Airline Sentiment - Analyze how travelers in February 2015 expressed their feelings on Twitter

Others

Feel free to add to this list!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Embedding

Related

Background

Environment

Files

Datasets

Pretrained Embeddings/Vectors

Visualizaiton Methods

Embedding Methods

Evaluations

Overview

Google Analogy Benchmark

MSR Analogy Benchmark

Wordrep Benchmark

SemEval2012 Benchmark

Applications

Downstream Tasks Dataset

Sentiment Analysis

Others

About

Releases

Packages

Languages

AkaCoder404/word-embedding-notes

Folders and files

Latest commit

History

Repository files navigation

Word Embedding

Related

Background

Environment

Files

Datasets

Pretrained Embeddings/Vectors

Visualizaiton Methods

Embedding Methods

Evaluations

Overview

Google Analogy Benchmark

MSR Analogy Benchmark

Wordrep Benchmark

SemEval2012 Benchmark

Applications

Downstream Tasks Dataset

Sentiment Analysis

Others

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages