From ca78ed906d81fcccf93d161b53c6664613bdd20d Mon Sep 17 00:00:00 2001 From: ngupta10 Date: Sat, 23 Sep 2023 00:15:41 +0530 Subject: [PATCH] Added Tokenizer and Vector Embedding documents --- docs/Tokenizer.md | 156 ++++++++++++++++++++++++++++++++++++++ docs/Vector Embeddings.md | 155 +++++++++++++++++++++++++++++++++++++ 2 files changed, 311 insertions(+) create mode 100644 docs/Tokenizer.md create mode 100644 docs/Vector Embeddings.md diff --git a/docs/Tokenizer.md b/docs/Tokenizer.md new file mode 100644 index 00000000..d7f99209 --- /dev/null +++ b/docs/Tokenizer.md @@ -0,0 +1,156 @@ +# Tokenizer Summary + +Tokenization is the process of converting input text into a list of tokens. In the context of transformers, tokenization includes splitting the input text into words, subwords, or symbols (like punctuation) that are used to train the model. +(https://huggingface.co/docs/transformers/tokenizer_summary) + +Three types of tokenizer we will explain: + +- Byte-Pair Encoding (BPE) + +- WordPiece + +- SentencePiece + + +# Byte-Pair Encoding (BPE) + +Byte-Pair Encoding (BPE) is a subword tokenization method that is used to represent open vocabularies effectively. It was originally introduced for byte-level compression but has since been adapted for tokenization in natural language processing, especially in neural machine translation. + +## How BPE Works + +1. **Initialization**: Start by representing each word as a sequence of characters, plus a special end-of-word symbol (e.g., ``). + +2. **Iterative Process**: Repeatedly merge the most frequent pair of consecutive symbols or characters. + +3. **Stop**: The process can be stopped either after a certain number of merges or when a desired vocabulary size is achieved. + +## Example + +Consider the vocabulary: `['low', 'lower', 'newest', 'widest']` and we want to apply BPE. + +1. **Initialization**: + +``` +low -> l o w +lower -> l o w e r +newest -> n e w e s t +widest -> w i d e s t + +``` + +2. **Iterative Process**: +- First merge: `e` and `s` are the most frequent pair, so merge them to form `es`. +- Second merge: `es` and `t` are now the most frequent pair, so merge them to form `est`. +- Continue this process until the desired number of merges is achieved. + +3. **Result**: +After several iterations, we might end up with subwords like `l`, `o`, `w`, `e`, `r`, `n`, `es`, `est`, `i`, `d`, `t`, and ``. + +## Language Models Using BPE + +BPE has been used in various state-of-the-art language models and neural machine translation models. Some notable models include: + +- **OpenAI's GPT-2**: This model uses a variant of BPE for its tokenization. +- **BERT**: While BERT primarily uses WordPiece, it's conceptually similar to BPE. +- **Transformer-based Neural Machine Translation models**: Such as those in the "Attention is All You Need" paper. + +BPE allows these models to handle rare words and out-of-vocabulary words by breaking them down into known subwords, enabling more flexible and robust tokenization. + +## Conclusion + +Byte-Pair Encoding is a powerful tokenization technique that bridges the gap between character-level and word-level tokenization. It's especially useful for languages with large vocabularies or for tasks where out-of-vocabulary words are common. + +For more in-depth information, refer to the original [BPE paper](https://arxiv.org/abs/1508.07909). + +# WordPiece Tokenization + +WordPiece is a subword tokenization method that is widely used in various state-of-the-art natural language processing models. It's designed to efficiently represent large vocabularies. + +## How WordPiece Works + +1. **Initialization**: Begin with the entire training data's character vocabulary. + +2. **Subword Creation**: Iteratively create subwords by choosing the most frequent character or character combination. This combination can be a new character sequence or a combination of existing subwords. + +3. **Stop**: The process is usually stopped when a desired vocabulary size is reached. + +## Example + +Consider the vocabulary: `['unwanted', 'unwarranted', 'under']` and we want to apply WordPiece. + +1. **Initialization**: +``` +unwanted -> u n w a n t e d +unwarranted -> u n w a r r a n t e d +under -> u n d e r +``` + +2. **Iterative Process**: +- First merge: `un` might be the most frequent subword, so it's kept as a subword. +- Second merge: `wa` or `ed` might be the next most frequent, and so on. +- Continue this process until the desired vocabulary size is achieved. + +3. **Result**: +After several iterations, we might end up with subwords like `un`, `wa`, `rr`, `ed`, `d`, `e`, and so on. + +## Language Models Using WordPiece + +WordPiece has been adopted by several prominent models in the NLP community: + +- **BERT**: BERT uses WordPiece for its tokenization, which is one of the reasons for its success in handling a wide range of NLP tasks. +- **DistilBERT**: A distilled version of BERT, also uses WordPiece. +- **MobileBERT**: Optimized for mobile devices, this model also employs WordPiece tokenization. + +The advantage of WordPiece is its ability to break down out-of-vocabulary words into subwords present in its vocabulary, allowing for better generalization and handling of rare words. + +## Conclusion + +WordPiece tokenization strikes a balance between character-level and word-level representations, making it a popular choice for models that need to handle diverse vocabularies without significantly increasing computational requirements. + +For more details, you can refer to the [BERT paper](https://arxiv.org/abs/1810.04805) where WordPiece tokenization played a crucial role. + + +# SentencePiece Tokenization + +SentencePiece is a data-driven, unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences). + +## How SentencePiece Works + +1. **Training**: SentencePiece trains tokenization model from raw sentences and does not require any preliminary tokenization. + +2. **Vocabulary Management**: Instead of words, SentencePiece handles the texts as raw input and the spaces are treated as a special symbol, allowing consistent tokenization for any input. + +3. **Subword Regularization**: Introduces randomness in the tokenization process to improve robustness and trainability. + +## Example + +Consider training SentencePiece on a dataset with sentences like `['I love machine learning', 'Machines are the future']`. + +1. **Initialization**: + The text is treated as raw, so spaces are also considered symbols. + +2. **Iterative Process**: + Using algorithms like BPE or unigram, frequent subwords or characters are merged or kept as potential tokens. + +3. **Result**: + After training, a sentence like `I love machines` might be tokenized as `['I', '▁love', '▁machines']`. + +## Language Models Using SentencePiece + +SentencePiece has been adopted by several models and platforms: + +- **ALBERT**: A lite version of BERT, ALBERT uses SentencePiece for its tokenization. +- **T2T (Tensor2Tensor)**: The Tensor2Tensor library from Google uses SentencePiece for some of its tokenization. +- **OpenNMT**: This open-source neural machine translation framework supports SentencePiece. +- **LLAMA2**: this open source model uses SentencePiece. + +The advantage of SentencePiece is its flexibility in handling multiple languages and scripts without the need for pre-tokenization, making it suitable for multilingual models and systems. + +## Conclusion + +SentencePiece provides a versatile and efficient tokenization method, especially for languages with complex scripts or for multilingual models. Its ability to train directly on raw text and manage vocabularies in a predetermined manner makes it a popular choice for modern NLP tasks. + +For more details and implementation, you can refer to the [SentencePiece GitHub repository](https://github.com/google/sentencepiece). + + + diff --git a/docs/Vector Embeddings.md b/docs/Vector Embeddings.md new file mode 100644 index 00000000..87c1d1d7 --- /dev/null +++ b/docs/Vector Embeddings.md @@ -0,0 +1,155 @@ +# Vector Embeddings + +Vector embeddings are mathematical representations of objects, such as words, documents, images, or videos, in a continuous vector space. These embeddings capture the semantic meaning or inherent properties of the objects, making them useful for various machine learning and deep learning tasks. + +## What are Vector Embeddings? + +At a high level, vector embeddings transform objects into fixed-size vectors in a way that similar objects are close to each other in the vector space, while dissimilar ones are farther apart. This transformation allows algorithms to understand and process complex objects in a more structured and meaningful manner. + +## Types of Embeddings + +### 1. Text Embeddings + +- **Word Embeddings**: Represent individual words as vectors. Examples include Word2Vec, GloVe, and FastText. +- **Sentence/Document Embeddings**: Represent entire sentences or documents as vectors. Examples include Doc2Vec, BERT embeddings, and Universal Sentence Encoder. + +### 2. Audio Embeddings + +Audio embeddings capture the inherent properties of audio signals, such as pitch, tempo, and timbre. They are used in tasks like audio classification, speaker recognition, and music recommendation. + +### 3. Image Embeddings + +Image embeddings transform visual data into a vector space. These embeddings capture the content and context of images. Popular methods include embeddings from pre-trained models like VGG, ResNet, and Inception. + +### 4. Video Embeddings + +Video embeddings represent videos in a continuous vector space by capturing both spatial (image frames) and temporal (sequence of frames) information. They are crucial for tasks like video classification, recommendation, and anomaly detection. + +### 5. Document Embeddings + +Document embeddings represent entire documents, capturing the overall semantic meaning. They are used in tasks like document clustering, classification, and information retrieval. + + +## Point of Interest : Document Embeddings + +Document embeddings are vector representations of entire documents or paragraphs. Unlike word embeddings, which represent individual words, document embeddings capture the overall semantic content and context of a document. + +## Why Document Embeddings? + +While word embeddings provide representations for individual words, in many tasks like document classification, similarity computation, or information retrieval, we need to understand the document as a whole. Document embeddings provide a holistic view of the entire document, making them suitable for such tasks. + +## Methods for Generating Document Embeddings + +### 1. Averaging Word Embeddings + +One of the simplest methods is to average the word embeddings of all words in a document. This method, while straightforward, can be surprisingly effective for many tasks. + +```python +import numpy as np +from gensim.models import Word2Vec + +# Load a pre-trained Word2Vec model +model = Word2Vec.load("path_to_pretrained_model") + +def document_embedding(doc): + embeddings = [model[word] for word in doc if word in model.wv.vocab] + return np.mean(embeddings, axis=0) +``` + +### 2. Doc2Vec + +Doc2Vec, an extension of Word2Vec, is specifically designed to produce document embeddings. It considers the context of words along with a unique identifier for each document. + +```python +from gensim.models import Doc2Vec +from gensim.models.doc2vec import TaggedDocument + +# Prepare data +documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)] + +# Train a Doc2Vec model +model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4) + +# Get document vector +vector = model.infer_vector(["word1", "word2", "word3"]) + +``` +### 3. Pre-trained Transformers + +Benchmark :- https://huggingface.co/spaces/mteb/leaderboard + +[Medium - Helper Code](https://medium.com/@ryanntk/choosing-the-right-embedding-model-a-guide-for-llm-applications-7a60180d28e3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjZmNzI1NDEwMWY1NmU0MWNmMzVjOTkyNmRlODRhMmQ1NTJiNGM2ZjEiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTE2NTAxMzgwMDYxMDIzMzEyMTUiLCJlbWFpbCI6Im5ndXB0YTEwLnNsYkBnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmJmIjoxNjk1NDAwMTg0LCJuYW1lIjoibmlzaGFudCBndXB0YSIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS9BQ2c4b2NLaUE5UkI4V3ZlbmR0bEtGZDkxQUtudzhNazN5a0pueXVLSVotMTZkUzI9czk2LWMiLCJnaXZlbl9uYW1lIjoibmlzaGFudCIsImZhbWlseV9uYW1lIjoiZ3VwdGEiLCJsb2NhbGUiOiJlbiIsImlhdCI6MTY5NTQwMDQ4NCwiZXhwIjoxNjk1NDA0MDg0LCJqdGkiOiJkYTNjODAyODk5YjE5OWYyYTMzZWZiYzdiZTc5NWQ3OGU0OTE2MDNlIn0.UBAxb43TA3aYonYND2dSr18U359k_Us62yKYo-zsuPtMx2dy1YJ_xapuGDD6rR28tuJz0Iv0Du3rTtTSyHoFgyHZDaScbiLsHpRD234V6GwevO6slglc3j_WV_DtVE_bphfCk68SlyM7dPRk7ib8I-loiW4nT7-VBvqgRoRh1_W-Y3blhE_5-ziQX5Z6aASdpHduVXsxxvXqZ7qxFA-tAizbO1mcoHTiUE-N2oLecrNJD5N7ljNwSqML8J3WzFK3vTghnLW89NloDUGhS85ZJif8_kqaf9rokg2OAZpgb4BLeUZFFeiggXbNp1GDex8HdIzOFIdU0meabcLqWef8zQ) + + +Modern transformer models like BERT, RoBERTa, and DistilBERT can be used to obtain document embeddings by averaging the embeddings of all tokens or using the embedding of a special token (e.g., [CLS]). + + +1. BERT + +```python +from transformers import BertTokenizer, BertModel +import torch + +tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') +model = BertModel.from_pretrained('bert-base-uncased') + +def get_bert_embedding(text): + inputs = tokenizer(text, return_tensors="pt") + outputs = model(**inputs) + return outputs['last_hidden_state'][0].mean(dim=0).detach().numpy() + +``` + +2. SentenceTransformers + +Is a Python framework for state-of-the-art sentence, text and image embeddings. (https://partee.io/2022/08/11/vector-embeddings/) + +Install the Sentence Transformers library. + +```python +import numpy as np + +from numpy.linalg import norm +from sentence_transformers import SentenceTransformer + +# Define the model we want to use (it'll download itself) +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') + +sentences = [ + "That is a very happy person", + "That is a happy dog", + "Today is a sunny day" +] + +# vector embeddings created from dataset +embeddings = model.encode(sentences) + +# query vector embedding +query_embedding = model.encode("That is a happy person") + +# define our distance metric +def cosine_similarity(a, b): + return np.dot(a, b)/(norm(a)*norm(b)) + +# run semantic similarity search +print("Query: That is a happy person") +for e, s in zip(embeddings, sentences): + print(s, " -> similarity score = ", + cosine_similarity(e, query_embedding)) + +``` +```python +Query: That is a happy person + +That is a very happy person -> similarity score = 0.94291496 +That is a happy dog -> similarity score = 0.69457746 +Today is a sunny day -> similarity score = 0.25687605 +``` + + +## Conclusion + +Vector embeddings play a pivotal role in modern machine learning and artificial intelligence systems. By converting complex objects into continuous vector representations, they enable algorithms to process and understand data in more nuanced and sophisticated ways. + +For a deeper dive into each type of embedding and their applications, numerous research papers, tutorials, and courses are available online. +