A simple python BPE implementation
- vocab is sorted by token lengh and frequency
- py_wiki.vocab is a pretrained tokenizer, it was trained on 1k wikipedia articles and 4k python scripts. It has a vocab size of 10_000
- hard_stop=False splits a sequnce over the max_len into batches to preserve the original text
- with hard_stop=False, max_len=512, it encodes ~100 wikipedia articles per second
Train
text = """
Byte pair encoding[1][2] or digram coding[3] is a simple
form of data compression in which the most common pair of consecutive
bytes of data is replaced with a byte that does not occur within that data."""
from BPE import BPEtokenizer
bpe = BPEtokenizer()
bpe.train(text,1000,75,'test.vocab')
Load
bpe = bpe.load('test.vocab')
Encode and Decode
token_list = bpe.encode('Hello World this is the BPE tokenizer',4,False,True,True)
print(token_list)
for token in token_list:
print(bpe.decode(token))
Print Vocab
print(bpe.tokens)