the translator for LLM, consist of encode & decode functions.
- the original code : https://github.com/karpathy/minbpe/
- online try various tokenizer : https://tiktokenizer.vercel.app/
- BPE (Byte Pair Encoding) tokenizer tiktoken lib : https://github.com/openai/tiktoken/
- train your own tokenizer with https://github.com/google/sentencepiece/
- article about the problem of gpt4 tokenizer : https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#
MIT