|
| 1 | +# Tokenizer Summary |
| 2 | + |
| 3 | +Tokenization is the process of converting input text into a list of tokens. In the context of transformers, tokenization includes splitting the input text into words, subwords, or symbols (like punctuation) that are used to train the model. |
| 4 | +(https://huggingface.co/docs/transformers/tokenizer_summary) |
| 5 | + |
| 6 | +Three types of tokenizer we will explain: |
| 7 | + |
| 8 | +- Byte-Pair Encoding (BPE) |
| 9 | + |
| 10 | +- WordPiece |
| 11 | + |
| 12 | +- SentencePiece |
| 13 | + |
| 14 | + |
| 15 | +# Byte-Pair Encoding (BPE) |
| 16 | + |
| 17 | +Byte-Pair Encoding (BPE) is a subword tokenization method that is used to represent open vocabularies effectively. It was originally introduced for byte-level compression but has since been adapted for tokenization in natural language processing, especially in neural machine translation. |
| 18 | + |
| 19 | +## How BPE Works |
| 20 | + |
| 21 | +1. **Initialization**: Start by representing each word as a sequence of characters, plus a special end-of-word symbol (e.g., `</w>`). |
| 22 | + |
| 23 | +2. **Iterative Process**: Repeatedly merge the most frequent pair of consecutive symbols or characters. |
| 24 | + |
| 25 | +3. **Stop**: The process can be stopped either after a certain number of merges or when a desired vocabulary size is achieved. |
| 26 | + |
| 27 | +## Example |
| 28 | + |
| 29 | +Consider the vocabulary: `['low', 'lower', 'newest', 'widest']` and we want to apply BPE. |
| 30 | + |
| 31 | +1. **Initialization**: |
| 32 | + |
| 33 | +``` |
| 34 | +low</w> -> l o w </w> |
| 35 | +lower</w> -> l o w e r </w> |
| 36 | +newest</w> -> n e w e s t </w> |
| 37 | +widest</w> -> w i d e s t </w> |
| 38 | +
|
| 39 | +``` |
| 40 | + |
| 41 | +2. **Iterative Process**: |
| 42 | +- First merge: `e` and `s` are the most frequent pair, so merge them to form `es`. |
| 43 | +- Second merge: `es` and `t` are now the most frequent pair, so merge them to form `est`. |
| 44 | +- Continue this process until the desired number of merges is achieved. |
| 45 | + |
| 46 | +3. **Result**: |
| 47 | +After several iterations, we might end up with subwords like `l`, `o`, `w`, `e`, `r`, `n`, `es`, `est`, `i`, `d`, `t`, and `</w>`. |
| 48 | + |
| 49 | +## Language Models Using BPE |
| 50 | + |
| 51 | +BPE has been used in various state-of-the-art language models and neural machine translation models. Some notable models include: |
| 52 | + |
| 53 | +- **OpenAI's GPT-2**: This model uses a variant of BPE for its tokenization. |
| 54 | +- **BERT**: While BERT primarily uses WordPiece, it's conceptually similar to BPE. |
| 55 | +- **Transformer-based Neural Machine Translation models**: Such as those in the "Attention is All You Need" paper. |
| 56 | + |
| 57 | +BPE allows these models to handle rare words and out-of-vocabulary words by breaking them down into known subwords, enabling more flexible and robust tokenization. |
| 58 | + |
| 59 | +## Conclusion |
| 60 | + |
| 61 | +Byte-Pair Encoding is a powerful tokenization technique that bridges the gap between character-level and word-level tokenization. It's especially useful for languages with large vocabularies or for tasks where out-of-vocabulary words are common. |
| 62 | + |
| 63 | +For more in-depth information, refer to the original [BPE paper](https://arxiv.org/abs/1508.07909). |
| 64 | + |
| 65 | +# WordPiece Tokenization |
| 66 | + |
| 67 | +WordPiece is a subword tokenization method that is widely used in various state-of-the-art natural language processing models. It's designed to efficiently represent large vocabularies. |
| 68 | + |
| 69 | +## How WordPiece Works |
| 70 | + |
| 71 | +1. **Initialization**: Begin with the entire training data's character vocabulary. |
| 72 | + |
| 73 | +2. **Subword Creation**: Iteratively create subwords by choosing the most frequent character or character combination. This combination can be a new character sequence or a combination of existing subwords. |
| 74 | + |
| 75 | +3. **Stop**: The process is usually stopped when a desired vocabulary size is reached. |
| 76 | + |
| 77 | +## Example |
| 78 | + |
| 79 | +Consider the vocabulary: `['unwanted', 'unwarranted', 'under']` and we want to apply WordPiece. |
| 80 | + |
| 81 | +1. **Initialization**: |
| 82 | +``` |
| 83 | +unwanted -> u n w a n t e d |
| 84 | +unwarranted -> u n w a r r a n t e d |
| 85 | +under -> u n d e r |
| 86 | +``` |
| 87 | + |
| 88 | +2. **Iterative Process**: |
| 89 | +- First merge: `un` might be the most frequent subword, so it's kept as a subword. |
| 90 | +- Second merge: `wa` or `ed` might be the next most frequent, and so on. |
| 91 | +- Continue this process until the desired vocabulary size is achieved. |
| 92 | + |
| 93 | +3. **Result**: |
| 94 | +After several iterations, we might end up with subwords like `un`, `wa`, `rr`, `ed`, `d`, `e`, and so on. |
| 95 | + |
| 96 | +## Language Models Using WordPiece |
| 97 | + |
| 98 | +WordPiece has been adopted by several prominent models in the NLP community: |
| 99 | + |
| 100 | +- **BERT**: BERT uses WordPiece for its tokenization, which is one of the reasons for its success in handling a wide range of NLP tasks. |
| 101 | +- **DistilBERT**: A distilled version of BERT, also uses WordPiece. |
| 102 | +- **MobileBERT**: Optimized for mobile devices, this model also employs WordPiece tokenization. |
| 103 | + |
| 104 | +The advantage of WordPiece is its ability to break down out-of-vocabulary words into subwords present in its vocabulary, allowing for better generalization and handling of rare words. |
| 105 | + |
| 106 | +## Conclusion |
| 107 | + |
| 108 | +WordPiece tokenization strikes a balance between character-level and word-level representations, making it a popular choice for models that need to handle diverse vocabularies without significantly increasing computational requirements. |
| 109 | + |
| 110 | +For more details, you can refer to the [BERT paper](https://arxiv.org/abs/1810.04805) where WordPiece tokenization played a crucial role. |
| 111 | + |
| 112 | + |
| 113 | +# SentencePiece Tokenization |
| 114 | + |
| 115 | +SentencePiece is a data-driven, unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences). |
| 116 | + |
| 117 | +## How SentencePiece Works |
| 118 | + |
| 119 | +1. **Training**: SentencePiece trains tokenization model from raw sentences and does not require any preliminary tokenization. |
| 120 | + |
| 121 | +2. **Vocabulary Management**: Instead of words, SentencePiece handles the texts as raw input and the spaces are treated as a special symbol, allowing consistent tokenization for any input. |
| 122 | + |
| 123 | +3. **Subword Regularization**: Introduces randomness in the tokenization process to improve robustness and trainability. |
| 124 | + |
| 125 | +## Example |
| 126 | + |
| 127 | +Consider training SentencePiece on a dataset with sentences like `['I love machine learning', 'Machines are the future']`. |
| 128 | + |
| 129 | +1. **Initialization**: |
| 130 | + The text is treated as raw, so spaces are also considered symbols. |
| 131 | + |
| 132 | +2. **Iterative Process**: |
| 133 | + Using algorithms like BPE or unigram, frequent subwords or characters are merged or kept as potential tokens. |
| 134 | + |
| 135 | +3. **Result**: |
| 136 | + After training, a sentence like `I love machines` might be tokenized as `['I', '▁love', '▁machines']`. |
| 137 | + |
| 138 | +## Language Models Using SentencePiece |
| 139 | + |
| 140 | +SentencePiece has been adopted by several models and platforms: |
| 141 | + |
| 142 | +- **ALBERT**: A lite version of BERT, ALBERT uses SentencePiece for its tokenization. |
| 143 | +- **T2T (Tensor2Tensor)**: The Tensor2Tensor library from Google uses SentencePiece for some of its tokenization. |
| 144 | +- **OpenNMT**: This open-source neural machine translation framework supports SentencePiece. |
| 145 | +- **LLAMA2**: this open source model uses SentencePiece. |
| 146 | + |
| 147 | +The advantage of SentencePiece is its flexibility in handling multiple languages and scripts without the need for pre-tokenization, making it suitable for multilingual models and systems. |
| 148 | + |
| 149 | +## Conclusion |
| 150 | + |
| 151 | +SentencePiece provides a versatile and efficient tokenization method, especially for languages with complex scripts or for multilingual models. Its ability to train directly on raw text and manage vocabularies in a predetermined manner makes it a popular choice for modern NLP tasks. |
| 152 | + |
| 153 | +For more details and implementation, you can refer to the [SentencePiece GitHub repository](https://github.com/google/sentencepiece). |
| 154 | + |
| 155 | + |
| 156 | + |
0 commit comments