Large Language Models introduced by OpenAI (called GPT models) use a process called Tokenization to convert words to numbers since neural networks only understand numbers. This repo is a fun project which shows how the text is actually converted to tokens and the number of tokens for various Encodings.
gpt2
- This is used in GPT-2 model.gpt-3.5
- This is used in GPT-3 and GPT-4 models.gpt-4o
- This is the one used in latest GPT-4o model
The dashboard provides an interactive way to show how words are tokenized with each tokenizer in real-time as a side-by-side comparison.
Live -> On render
- Clone this repository
git clone https://github.com/gdevakumar/Illustrative-Tokenizers.git
cd Illustrative-Tokenizers
- Install Python from here and project dependencies
pip install -r requirements.txt
- Launch the Web UI with Flask application
python3 app.py
Use this method if you have Docker/Docker Desktop installed.
- Clone this repository
git clone https://github.com/gdevakumar/Illustrative-Tokenizers.git
cd Illustrative-Tokenizers
- Build the docker image (Notice the dot(.) at the end of command). This may take sometime initially
docker build -t tokenizers .
- Run the Docker image
docker run -p 5000:5000 tokenizers