Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Tokenizer and Vector Embedding documents #73

Merged
merged 1 commit into from
Sep 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions docs/Tokenizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Tokenizer Summary

Tokenization is the process of converting input text into a list of tokens. In the context of transformers, tokenization includes splitting the input text into words, subwords, or symbols (like punctuation) that are used to train the model.
(https://huggingface.co/docs/transformers/tokenizer_summary)

Three types of tokenizer we will explain:

- Byte-Pair Encoding (BPE)

- WordPiece

- SentencePiece


# Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a subword tokenization method that is used to represent open vocabularies effectively. It was originally introduced for byte-level compression but has since been adapted for tokenization in natural language processing, especially in neural machine translation.

## How BPE Works

1. **Initialization**: Start by representing each word as a sequence of characters, plus a special end-of-word symbol (e.g., `</w>`).

2. **Iterative Process**: Repeatedly merge the most frequent pair of consecutive symbols or characters.

3. **Stop**: The process can be stopped either after a certain number of merges or when a desired vocabulary size is achieved.

## Example

Consider the vocabulary: `['low', 'lower', 'newest', 'widest']` and we want to apply BPE.

1. **Initialization**:

```
low</w> -> l o w </w>
lower</w> -> l o w e r </w>
newest</w> -> n e w e s t </w>
widest</w> -> w i d e s t </w>

```

2. **Iterative Process**:
- First merge: `e` and `s` are the most frequent pair, so merge them to form `es`.
- Second merge: `es` and `t` are now the most frequent pair, so merge them to form `est`.
- Continue this process until the desired number of merges is achieved.

3. **Result**:
After several iterations, we might end up with subwords like `l`, `o`, `w`, `e`, `r`, `n`, `es`, `est`, `i`, `d`, `t`, and `</w>`.

## Language Models Using BPE

BPE has been used in various state-of-the-art language models and neural machine translation models. Some notable models include:

- **OpenAI's GPT-2**: This model uses a variant of BPE for its tokenization.
- **BERT**: While BERT primarily uses WordPiece, it's conceptually similar to BPE.
- **Transformer-based Neural Machine Translation models**: Such as those in the "Attention is All You Need" paper.

BPE allows these models to handle rare words and out-of-vocabulary words by breaking them down into known subwords, enabling more flexible and robust tokenization.

## Conclusion

Byte-Pair Encoding is a powerful tokenization technique that bridges the gap between character-level and word-level tokenization. It's especially useful for languages with large vocabularies or for tasks where out-of-vocabulary words are common.

For more in-depth information, refer to the original [BPE paper](https://arxiv.org/abs/1508.07909).

# WordPiece Tokenization

WordPiece is a subword tokenization method that is widely used in various state-of-the-art natural language processing models. It's designed to efficiently represent large vocabularies.

## How WordPiece Works

1. **Initialization**: Begin with the entire training data's character vocabulary.

2. **Subword Creation**: Iteratively create subwords by choosing the most frequent character or character combination. This combination can be a new character sequence or a combination of existing subwords.

3. **Stop**: The process is usually stopped when a desired vocabulary size is reached.

## Example

Consider the vocabulary: `['unwanted', 'unwarranted', 'under']` and we want to apply WordPiece.

1. **Initialization**:
```
unwanted -> u n w a n t e d
unwarranted -> u n w a r r a n t e d
under -> u n d e r
```

2. **Iterative Process**:
- First merge: `un` might be the most frequent subword, so it's kept as a subword.
- Second merge: `wa` or `ed` might be the next most frequent, and so on.
- Continue this process until the desired vocabulary size is achieved.

3. **Result**:
After several iterations, we might end up with subwords like `un`, `wa`, `rr`, `ed`, `d`, `e`, and so on.

## Language Models Using WordPiece

WordPiece has been adopted by several prominent models in the NLP community:

- **BERT**: BERT uses WordPiece for its tokenization, which is one of the reasons for its success in handling a wide range of NLP tasks.
- **DistilBERT**: A distilled version of BERT, also uses WordPiece.
- **MobileBERT**: Optimized for mobile devices, this model also employs WordPiece tokenization.

The advantage of WordPiece is its ability to break down out-of-vocabulary words into subwords present in its vocabulary, allowing for better generalization and handling of rare words.

## Conclusion

WordPiece tokenization strikes a balance between character-level and word-level representations, making it a popular choice for models that need to handle diverse vocabularies without significantly increasing computational requirements.

For more details, you can refer to the [BERT paper](https://arxiv.org/abs/1810.04805) where WordPiece tokenization played a crucial role.


# SentencePiece Tokenization

SentencePiece is a data-driven, unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences).

## How SentencePiece Works

1. **Training**: SentencePiece trains tokenization model from raw sentences and does not require any preliminary tokenization.

2. **Vocabulary Management**: Instead of words, SentencePiece handles the texts as raw input and the spaces are treated as a special symbol, allowing consistent tokenization for any input.

3. **Subword Regularization**: Introduces randomness in the tokenization process to improve robustness and trainability.

## Example

Consider training SentencePiece on a dataset with sentences like `['I love machine learning', 'Machines are the future']`.

1. **Initialization**:
The text is treated as raw, so spaces are also considered symbols.

2. **Iterative Process**:
Using algorithms like BPE or unigram, frequent subwords or characters are merged or kept as potential tokens.

3. **Result**:
After training, a sentence like `I love machines` might be tokenized as `['I', '▁love', '▁machines']`.

## Language Models Using SentencePiece

SentencePiece has been adopted by several models and platforms:

- **ALBERT**: A lite version of BERT, ALBERT uses SentencePiece for its tokenization.
- **T2T (Tensor2Tensor)**: The Tensor2Tensor library from Google uses SentencePiece for some of its tokenization.
- **OpenNMT**: This open-source neural machine translation framework supports SentencePiece.
- **LLAMA2**: this open source model uses SentencePiece.

The advantage of SentencePiece is its flexibility in handling multiple languages and scripts without the need for pre-tokenization, making it suitable for multilingual models and systems.

## Conclusion

SentencePiece provides a versatile and efficient tokenization method, especially for languages with complex scripts or for multilingual models. Its ability to train directly on raw text and manage vocabularies in a predetermined manner makes it a popular choice for modern NLP tasks.

For more details and implementation, you can refer to the [SentencePiece GitHub repository](https://github.com/google/sentencepiece).



155 changes: 155 additions & 0 deletions docs/Vector Embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Vector Embeddings

Vector embeddings are mathematical representations of objects, such as words, documents, images, or videos, in a continuous vector space. These embeddings capture the semantic meaning or inherent properties of the objects, making them useful for various machine learning and deep learning tasks.

## What are Vector Embeddings?

At a high level, vector embeddings transform objects into fixed-size vectors in a way that similar objects are close to each other in the vector space, while dissimilar ones are farther apart. This transformation allows algorithms to understand and process complex objects in a more structured and meaningful manner.

## Types of Embeddings

### 1. Text Embeddings

- **Word Embeddings**: Represent individual words as vectors. Examples include Word2Vec, GloVe, and FastText.
- **Sentence/Document Embeddings**: Represent entire sentences or documents as vectors. Examples include Doc2Vec, BERT embeddings, and Universal Sentence Encoder.

### 2. Audio Embeddings

Audio embeddings capture the inherent properties of audio signals, such as pitch, tempo, and timbre. They are used in tasks like audio classification, speaker recognition, and music recommendation.

### 3. Image Embeddings

Image embeddings transform visual data into a vector space. These embeddings capture the content and context of images. Popular methods include embeddings from pre-trained models like VGG, ResNet, and Inception.

### 4. Video Embeddings

Video embeddings represent videos in a continuous vector space by capturing both spatial (image frames) and temporal (sequence of frames) information. They are crucial for tasks like video classification, recommendation, and anomaly detection.

### 5. Document Embeddings

Document embeddings represent entire documents, capturing the overall semantic meaning. They are used in tasks like document clustering, classification, and information retrieval.


## Point of Interest : Document Embeddings

Document embeddings are vector representations of entire documents or paragraphs. Unlike word embeddings, which represent individual words, document embeddings capture the overall semantic content and context of a document.

## Why Document Embeddings?

While word embeddings provide representations for individual words, in many tasks like document classification, similarity computation, or information retrieval, we need to understand the document as a whole. Document embeddings provide a holistic view of the entire document, making them suitable for such tasks.

## Methods for Generating Document Embeddings

### 1. Averaging Word Embeddings

One of the simplest methods is to average the word embeddings of all words in a document. This method, while straightforward, can be surprisingly effective for many tasks.

```python
import numpy as np
from gensim.models import Word2Vec

# Load a pre-trained Word2Vec model
model = Word2Vec.load("path_to_pretrained_model")

def document_embedding(doc):
embeddings = [model[word] for word in doc if word in model.wv.vocab]
return np.mean(embeddings, axis=0)
```

### 2. Doc2Vec

Doc2Vec, an extension of Word2Vec, is specifically designed to produce document embeddings. It considers the context of words along with a unique identifier for each document.

```python
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# Prepare data
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]

# Train a Doc2Vec model
model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4)

# Get document vector
vector = model.infer_vector(["word1", "word2", "word3"])

```
### 3. Pre-trained Transformers

Benchmark :- https://huggingface.co/spaces/mteb/leaderboard

[Medium - Helper Code](https://medium.com/@ryanntk/choosing-the-right-embedding-model-a-guide-for-llm-applications-7a60180d28e3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjZmNzI1NDEwMWY1NmU0MWNmMzVjOTkyNmRlODRhMmQ1NTJiNGM2ZjEiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTE2NTAxMzgwMDYxMDIzMzEyMTUiLCJlbWFpbCI6Im5ndXB0YTEwLnNsYkBnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmJmIjoxNjk1NDAwMTg0LCJuYW1lIjoibmlzaGFudCBndXB0YSIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS9BQ2c4b2NLaUE5UkI4V3ZlbmR0bEtGZDkxQUtudzhNazN5a0pueXVLSVotMTZkUzI9czk2LWMiLCJnaXZlbl9uYW1lIjoibmlzaGFudCIsImZhbWlseV9uYW1lIjoiZ3VwdGEiLCJsb2NhbGUiOiJlbiIsImlhdCI6MTY5NTQwMDQ4NCwiZXhwIjoxNjk1NDA0MDg0LCJqdGkiOiJkYTNjODAyODk5YjE5OWYyYTMzZWZiYzdiZTc5NWQ3OGU0OTE2MDNlIn0.UBAxb43TA3aYonYND2dSr18U359k_Us62yKYo-zsuPtMx2dy1YJ_xapuGDD6rR28tuJz0Iv0Du3rTtTSyHoFgyHZDaScbiLsHpRD234V6GwevO6slglc3j_WV_DtVE_bphfCk68SlyM7dPRk7ib8I-loiW4nT7-VBvqgRoRh1_W-Y3blhE_5-ziQX5Z6aASdpHduVXsxxvXqZ7qxFA-tAizbO1mcoHTiUE-N2oLecrNJD5N7ljNwSqML8J3WzFK3vTghnLW89NloDUGhS85ZJif8_kqaf9rokg2OAZpgb4BLeUZFFeiggXbNp1GDex8HdIzOFIdU0meabcLqWef8zQ)


Modern transformer models like BERT, RoBERTa, and DistilBERT can be used to obtain document embeddings by averaging the embeddings of all tokens or using the embedding of a special token (e.g., [CLS]).


1. BERT

```python
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
return outputs['last_hidden_state'][0].mean(dim=0).detach().numpy()

```

2. SentenceTransformers

Is a Python framework for state-of-the-art sentence, text and image embeddings. (https://partee.io/2022/08/11/vector-embeddings/)

Install the Sentence Transformers library.

```python
import numpy as np

from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

# Define the model we want to use (it'll download itself)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

sentences = [
"That is a very happy person",
"That is a happy dog",
"Today is a sunny day"
]

# vector embeddings created from dataset
embeddings = model.encode(sentences)

# query vector embedding
query_embedding = model.encode("That is a happy person")

# define our distance metric
def cosine_similarity(a, b):
return np.dot(a, b)/(norm(a)*norm(b))

# run semantic similarity search
print("Query: That is a happy person")
for e, s in zip(embeddings, sentences):
print(s, " -> similarity score = ",
cosine_similarity(e, query_embedding))

```
```python
Query: That is a happy person

That is a very happy person -> similarity score = 0.94291496
That is a happy dog -> similarity score = 0.69457746
Today is a sunny day -> similarity score = 0.25687605
```


## Conclusion

Vector embeddings play a pivotal role in modern machine learning and artificial intelligence systems. By converting complex objects into continuous vector representations, they enable algorithms to process and understand data in more nuanced and sophisticated ways.

For a deeper dive into each type of embedding and their applications, numerous research papers, tutorials, and courses are available online.