Skip to content

Commit 12c9221

Browse files
authored
Added Tokenizer and Vector Embedding documents (#73)
1 parent 37a9c8b commit 12c9221

File tree

2 files changed

+311
-0
lines changed

2 files changed

+311
-0
lines changed

docs/Tokenizer.md

+156
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Tokenizer Summary
2+
3+
Tokenization is the process of converting input text into a list of tokens. In the context of transformers, tokenization includes splitting the input text into words, subwords, or symbols (like punctuation) that are used to train the model.
4+
(https://huggingface.co/docs/transformers/tokenizer_summary)
5+
6+
Three types of tokenizer we will explain:
7+
8+
- Byte-Pair Encoding (BPE)
9+
10+
- WordPiece
11+
12+
- SentencePiece
13+
14+
15+
# Byte-Pair Encoding (BPE)
16+
17+
Byte-Pair Encoding (BPE) is a subword tokenization method that is used to represent open vocabularies effectively. It was originally introduced for byte-level compression but has since been adapted for tokenization in natural language processing, especially in neural machine translation.
18+
19+
## How BPE Works
20+
21+
1. **Initialization**: Start by representing each word as a sequence of characters, plus a special end-of-word symbol (e.g., `</w>`).
22+
23+
2. **Iterative Process**: Repeatedly merge the most frequent pair of consecutive symbols or characters.
24+
25+
3. **Stop**: The process can be stopped either after a certain number of merges or when a desired vocabulary size is achieved.
26+
27+
## Example
28+
29+
Consider the vocabulary: `['low', 'lower', 'newest', 'widest']` and we want to apply BPE.
30+
31+
1. **Initialization**:
32+
33+
```
34+
low</w> -> l o w </w>
35+
lower</w> -> l o w e r </w>
36+
newest</w> -> n e w e s t </w>
37+
widest</w> -> w i d e s t </w>
38+
39+
```
40+
41+
2. **Iterative Process**:
42+
- First merge: `e` and `s` are the most frequent pair, so merge them to form `es`.
43+
- Second merge: `es` and `t` are now the most frequent pair, so merge them to form `est`.
44+
- Continue this process until the desired number of merges is achieved.
45+
46+
3. **Result**:
47+
After several iterations, we might end up with subwords like `l`, `o`, `w`, `e`, `r`, `n`, `es`, `est`, `i`, `d`, `t`, and `</w>`.
48+
49+
## Language Models Using BPE
50+
51+
BPE has been used in various state-of-the-art language models and neural machine translation models. Some notable models include:
52+
53+
- **OpenAI's GPT-2**: This model uses a variant of BPE for its tokenization.
54+
- **BERT**: While BERT primarily uses WordPiece, it's conceptually similar to BPE.
55+
- **Transformer-based Neural Machine Translation models**: Such as those in the "Attention is All You Need" paper.
56+
57+
BPE allows these models to handle rare words and out-of-vocabulary words by breaking them down into known subwords, enabling more flexible and robust tokenization.
58+
59+
## Conclusion
60+
61+
Byte-Pair Encoding is a powerful tokenization technique that bridges the gap between character-level and word-level tokenization. It's especially useful for languages with large vocabularies or for tasks where out-of-vocabulary words are common.
62+
63+
For more in-depth information, refer to the original [BPE paper](https://arxiv.org/abs/1508.07909).
64+
65+
# WordPiece Tokenization
66+
67+
WordPiece is a subword tokenization method that is widely used in various state-of-the-art natural language processing models. It's designed to efficiently represent large vocabularies.
68+
69+
## How WordPiece Works
70+
71+
1. **Initialization**: Begin with the entire training data's character vocabulary.
72+
73+
2. **Subword Creation**: Iteratively create subwords by choosing the most frequent character or character combination. This combination can be a new character sequence or a combination of existing subwords.
74+
75+
3. **Stop**: The process is usually stopped when a desired vocabulary size is reached.
76+
77+
## Example
78+
79+
Consider the vocabulary: `['unwanted', 'unwarranted', 'under']` and we want to apply WordPiece.
80+
81+
1. **Initialization**:
82+
```
83+
unwanted -> u n w a n t e d
84+
unwarranted -> u n w a r r a n t e d
85+
under -> u n d e r
86+
```
87+
88+
2. **Iterative Process**:
89+
- First merge: `un` might be the most frequent subword, so it's kept as a subword.
90+
- Second merge: `wa` or `ed` might be the next most frequent, and so on.
91+
- Continue this process until the desired vocabulary size is achieved.
92+
93+
3. **Result**:
94+
After several iterations, we might end up with subwords like `un`, `wa`, `rr`, `ed`, `d`, `e`, and so on.
95+
96+
## Language Models Using WordPiece
97+
98+
WordPiece has been adopted by several prominent models in the NLP community:
99+
100+
- **BERT**: BERT uses WordPiece for its tokenization, which is one of the reasons for its success in handling a wide range of NLP tasks.
101+
- **DistilBERT**: A distilled version of BERT, also uses WordPiece.
102+
- **MobileBERT**: Optimized for mobile devices, this model also employs WordPiece tokenization.
103+
104+
The advantage of WordPiece is its ability to break down out-of-vocabulary words into subwords present in its vocabulary, allowing for better generalization and handling of rare words.
105+
106+
## Conclusion
107+
108+
WordPiece tokenization strikes a balance between character-level and word-level representations, making it a popular choice for models that need to handle diverse vocabularies without significantly increasing computational requirements.
109+
110+
For more details, you can refer to the [BERT paper](https://arxiv.org/abs/1810.04805) where WordPiece tokenization played a crucial role.
111+
112+
113+
# SentencePiece Tokenization
114+
115+
SentencePiece is a data-driven, unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences).
116+
117+
## How SentencePiece Works
118+
119+
1. **Training**: SentencePiece trains tokenization model from raw sentences and does not require any preliminary tokenization.
120+
121+
2. **Vocabulary Management**: Instead of words, SentencePiece handles the texts as raw input and the spaces are treated as a special symbol, allowing consistent tokenization for any input.
122+
123+
3. **Subword Regularization**: Introduces randomness in the tokenization process to improve robustness and trainability.
124+
125+
## Example
126+
127+
Consider training SentencePiece on a dataset with sentences like `['I love machine learning', 'Machines are the future']`.
128+
129+
1. **Initialization**:
130+
The text is treated as raw, so spaces are also considered symbols.
131+
132+
2. **Iterative Process**:
133+
Using algorithms like BPE or unigram, frequent subwords or characters are merged or kept as potential tokens.
134+
135+
3. **Result**:
136+
After training, a sentence like `I love machines` might be tokenized as `['I', '▁love', '▁machines']`.
137+
138+
## Language Models Using SentencePiece
139+
140+
SentencePiece has been adopted by several models and platforms:
141+
142+
- **ALBERT**: A lite version of BERT, ALBERT uses SentencePiece for its tokenization.
143+
- **T2T (Tensor2Tensor)**: The Tensor2Tensor library from Google uses SentencePiece for some of its tokenization.
144+
- **OpenNMT**: This open-source neural machine translation framework supports SentencePiece.
145+
- **LLAMA2**: this open source model uses SentencePiece.
146+
147+
The advantage of SentencePiece is its flexibility in handling multiple languages and scripts without the need for pre-tokenization, making it suitable for multilingual models and systems.
148+
149+
## Conclusion
150+
151+
SentencePiece provides a versatile and efficient tokenization method, especially for languages with complex scripts or for multilingual models. Its ability to train directly on raw text and manage vocabularies in a predetermined manner makes it a popular choice for modern NLP tasks.
152+
153+
For more details and implementation, you can refer to the [SentencePiece GitHub repository](https://github.com/google/sentencepiece).
154+
155+
156+

docs/Vector Embeddings.md

+155
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Vector Embeddings
2+
3+
Vector embeddings are mathematical representations of objects, such as words, documents, images, or videos, in a continuous vector space. These embeddings capture the semantic meaning or inherent properties of the objects, making them useful for various machine learning and deep learning tasks.
4+
5+
## What are Vector Embeddings?
6+
7+
At a high level, vector embeddings transform objects into fixed-size vectors in a way that similar objects are close to each other in the vector space, while dissimilar ones are farther apart. This transformation allows algorithms to understand and process complex objects in a more structured and meaningful manner.
8+
9+
## Types of Embeddings
10+
11+
### 1. Text Embeddings
12+
13+
- **Word Embeddings**: Represent individual words as vectors. Examples include Word2Vec, GloVe, and FastText.
14+
- **Sentence/Document Embeddings**: Represent entire sentences or documents as vectors. Examples include Doc2Vec, BERT embeddings, and Universal Sentence Encoder.
15+
16+
### 2. Audio Embeddings
17+
18+
Audio embeddings capture the inherent properties of audio signals, such as pitch, tempo, and timbre. They are used in tasks like audio classification, speaker recognition, and music recommendation.
19+
20+
### 3. Image Embeddings
21+
22+
Image embeddings transform visual data into a vector space. These embeddings capture the content and context of images. Popular methods include embeddings from pre-trained models like VGG, ResNet, and Inception.
23+
24+
### 4. Video Embeddings
25+
26+
Video embeddings represent videos in a continuous vector space by capturing both spatial (image frames) and temporal (sequence of frames) information. They are crucial for tasks like video classification, recommendation, and anomaly detection.
27+
28+
### 5. Document Embeddings
29+
30+
Document embeddings represent entire documents, capturing the overall semantic meaning. They are used in tasks like document clustering, classification, and information retrieval.
31+
32+
33+
## Point of Interest : Document Embeddings
34+
35+
Document embeddings are vector representations of entire documents or paragraphs. Unlike word embeddings, which represent individual words, document embeddings capture the overall semantic content and context of a document.
36+
37+
## Why Document Embeddings?
38+
39+
While word embeddings provide representations for individual words, in many tasks like document classification, similarity computation, or information retrieval, we need to understand the document as a whole. Document embeddings provide a holistic view of the entire document, making them suitable for such tasks.
40+
41+
## Methods for Generating Document Embeddings
42+
43+
### 1. Averaging Word Embeddings
44+
45+
One of the simplest methods is to average the word embeddings of all words in a document. This method, while straightforward, can be surprisingly effective for many tasks.
46+
47+
```python
48+
import numpy as np
49+
from gensim.models import Word2Vec
50+
51+
# Load a pre-trained Word2Vec model
52+
model = Word2Vec.load("path_to_pretrained_model")
53+
54+
def document_embedding(doc):
55+
embeddings = [model[word] for word in doc if word in model.wv.vocab]
56+
return np.mean(embeddings, axis=0)
57+
```
58+
59+
### 2. Doc2Vec
60+
61+
Doc2Vec, an extension of Word2Vec, is specifically designed to produce document embeddings. It considers the context of words along with a unique identifier for each document.
62+
63+
```python
64+
from gensim.models import Doc2Vec
65+
from gensim.models.doc2vec import TaggedDocument
66+
67+
# Prepare data
68+
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
69+
70+
# Train a Doc2Vec model
71+
model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4)
72+
73+
# Get document vector
74+
vector = model.infer_vector(["word1", "word2", "word3"])
75+
76+
```
77+
### 3. Pre-trained Transformers
78+
79+
Benchmark :- https://huggingface.co/spaces/mteb/leaderboard
80+
81+
[Medium - Helper Code](https://medium.com/@ryanntk/choosing-the-right-embedding-model-a-guide-for-llm-applications-7a60180d28e3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjZmNzI1NDEwMWY1NmU0MWNmMzVjOTkyNmRlODRhMmQ1NTJiNGM2ZjEiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTE2NTAxMzgwMDYxMDIzMzEyMTUiLCJlbWFpbCI6Im5ndXB0YTEwLnNsYkBnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmJmIjoxNjk1NDAwMTg0LCJuYW1lIjoibmlzaGFudCBndXB0YSIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS9BQ2c4b2NLaUE5UkI4V3ZlbmR0bEtGZDkxQUtudzhNazN5a0pueXVLSVotMTZkUzI9czk2LWMiLCJnaXZlbl9uYW1lIjoibmlzaGFudCIsImZhbWlseV9uYW1lIjoiZ3VwdGEiLCJsb2NhbGUiOiJlbiIsImlhdCI6MTY5NTQwMDQ4NCwiZXhwIjoxNjk1NDA0MDg0LCJqdGkiOiJkYTNjODAyODk5YjE5OWYyYTMzZWZiYzdiZTc5NWQ3OGU0OTE2MDNlIn0.UBAxb43TA3aYonYND2dSr18U359k_Us62yKYo-zsuPtMx2dy1YJ_xapuGDD6rR28tuJz0Iv0Du3rTtTSyHoFgyHZDaScbiLsHpRD234V6GwevO6slglc3j_WV_DtVE_bphfCk68SlyM7dPRk7ib8I-loiW4nT7-VBvqgRoRh1_W-Y3blhE_5-ziQX5Z6aASdpHduVXsxxvXqZ7qxFA-tAizbO1mcoHTiUE-N2oLecrNJD5N7ljNwSqML8J3WzFK3vTghnLW89NloDUGhS85ZJif8_kqaf9rokg2OAZpgb4BLeUZFFeiggXbNp1GDex8HdIzOFIdU0meabcLqWef8zQ)
82+
83+
84+
Modern transformer models like BERT, RoBERTa, and DistilBERT can be used to obtain document embeddings by averaging the embeddings of all tokens or using the embedding of a special token (e.g., [CLS]).
85+
86+
87+
1. BERT
88+
89+
```python
90+
from transformers import BertTokenizer, BertModel
91+
import torch
92+
93+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
94+
model = BertModel.from_pretrained('bert-base-uncased')
95+
96+
def get_bert_embedding(text):
97+
inputs = tokenizer(text, return_tensors="pt")
98+
outputs = model(**inputs)
99+
return outputs['last_hidden_state'][0].mean(dim=0).detach().numpy()
100+
101+
```
102+
103+
2. SentenceTransformers
104+
105+
Is a Python framework for state-of-the-art sentence, text and image embeddings. (https://partee.io/2022/08/11/vector-embeddings/)
106+
107+
Install the Sentence Transformers library.
108+
109+
```python
110+
import numpy as np
111+
112+
from numpy.linalg import norm
113+
from sentence_transformers import SentenceTransformer
114+
115+
# Define the model we want to use (it'll download itself)
116+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
117+
118+
sentences = [
119+
"That is a very happy person",
120+
"That is a happy dog",
121+
"Today is a sunny day"
122+
]
123+
124+
# vector embeddings created from dataset
125+
embeddings = model.encode(sentences)
126+
127+
# query vector embedding
128+
query_embedding = model.encode("That is a happy person")
129+
130+
# define our distance metric
131+
def cosine_similarity(a, b):
132+
return np.dot(a, b)/(norm(a)*norm(b))
133+
134+
# run semantic similarity search
135+
print("Query: That is a happy person")
136+
for e, s in zip(embeddings, sentences):
137+
print(s, " -> similarity score = ",
138+
cosine_similarity(e, query_embedding))
139+
140+
```
141+
```python
142+
Query: That is a happy person
143+
144+
That is a very happy person -> similarity score = 0.94291496
145+
That is a happy dog -> similarity score = 0.69457746
146+
Today is a sunny day -> similarity score = 0.25687605
147+
```
148+
149+
150+
## Conclusion
151+
152+
Vector embeddings play a pivotal role in modern machine learning and artificial intelligence systems. By converting complex objects into continuous vector representations, they enable algorithms to process and understand data in more nuanced and sophisticated ways.
153+
154+
For a deeper dive into each type of embedding and their applications, numerous research papers, tutorials, and courses are available online.
155+

0 commit comments

Comments
 (0)