Skip to content

AkaCoder404/word-embedding-notes

Repository files navigation

Word Embedding

This repository contains helpful resources about word embeddings. It also includes methods to evaluate, visualize, and apply popular pretrained embeddings. A more detailed report on my findings can be found at .report.md.

Related

Background

Goals

  1. Use various embedding techniques to embed words.
  2. Project the embeddings onto the 2-D space (using tool t-SNE) and discuss observations from the visualizations
  3. Applications of word embeddings on Downstream tasks like Sentiment Analysis

Environment

Python 3.8.18

Files

data/               # store all the datasets/pretrained embeddings here
out/                # all the trained models will be here
SentimentAnalysis/  # Applications of trained embeddings on sentiment analysis
run.py              # Gives some example how to train your own embeddings
visualize.ipynb     # Gives some examples to visualize word embeddings

Datasets

Here is a list of corpus that can be used for embedding. Note that some datasets are preprocessed while some have not been and require additional preprocessing befor you embed them

Pretrained Embeddings/Vectors

Here is a list of pretrained embeddings and where you can find them. Some of them are pretrained on the datasets listed above.

  • Word2Vec, Google News Embeddings - Word2vec embeddings (3 million 300-dimension english word vectors) trained on Google News corpus (3 billion running words)
  • GloVe Wikipedia 2014 + Gigaword 5 - GloVe, 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors
  • Glove Twitter - GloVe, 2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors
  • NLPL word embeddings repository - Contains a ton of pretrained word embeddings, what corpus they were trained on, algorithms, and other parameters like vectgor size, vocabulary size, algorith, and lemmatization (or lack of).

A table.

Name Type Vector Size Vocab size
GoogleNews Word2Vec 300 3,000,000
GloveTwitter Glove 50, 100, 200, 300 1,200,000
Common Crawl Glove 300 1,900,000
Common Crawl Glove 300 2,200,000

Visualizaiton Methods

  • Scatter Plots with reduction (TSE, PCA)
  • Word Clouds

Embedding Methods

  • One Hot Encoding
  • Context-Free
    • Word2vec
    • GloVE
  • Contextual
    • Unidirectional
    • Shallowly Bidirectional
    • HPcA (High Performance Contextualized Attention-based)
    • morphoRNNLM (Morphological Regularization Neural Network Language Model)
    • LexVec (Lexicon-Enhanced Vector Representation)
  • ConceptNet
  • HDC/PDC ()

Evaluations

There are many basic metrics to evaluate the word embeddings.

Overview

Analogy Datasets

Similarity Datasets

  • SemEval2012 - SemEval2012 dataset for relational similarity orginal
  • WS353 or wordsim354 - WS353 dataset for testing attributional and relatedness similarity
  • SimLex999 - SimLex999 dataset for testing attributional similarity
  • TR9856 - TR9856 dataset for testing multi-word term relatedness
  • MTurk - MTurk dataset for testing attributional similarity
  • RG65 - Rubenstein and Goodenough dataset for testing attributional and relatedness similarity
  • RW - Rare Words dataset for testing attributional similarity
  • MEN - MEN dataset for testing similarity and relatedness

Categorization Datasets

Google Analogy Benchmark

Sources:

The google analogy benchmark has a total of 19544 questions across 14 different categories.

MSR Analogy Benchmark

Sources:

The MSR analogy benchmark has a total of 8000 questions across 16 categoires.

Wordrep Benchmark

Source: https://arxiv.org/abs/1407.1640

The main task for WordRep is analogical reasoning.

SemEval2012 Benchmark

Source: https://aclanthology.org/S12-1047/

The main task for SemEval2012 is finding degree of similarity.

Applications

  • Text Classification
  • Named Entity Recognition (NER)
  • Machine Translation
  • Information Retrieval
  • Question Answering
  • Semantic Similarity and Clustering
  • Text Generation
  • Similarity and analogy
  • Pre-training models

Downstream Tasks Dataset

  • Yelp - An all-purpose dataset for learning

Sentiment Analysis

Others

Feel free to add to this list!

About

Word Embeddings...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published