BytePairEncoding-for-NLP

A simple python BPE implementation

vocab is sorted by token lengh and frequency
py_wiki.vocab is a pretrained tokenizer, it was trained on 1k wikipedia articles and 4k python scripts. It has a vocab size of 10_000
hard_stop=False splits a sequnce over the max_len into batches to preserve the original text
with hard_stop=False, max_len=512, it encodes ~100 wikipedia articles per second

Train

text = """
Byte pair encoding[1][2] or digram coding[3] is a simple
form of data compression in which the most common pair of consecutive
bytes of data is replaced with a byte that does not occur within that data."""

from BPE import BPEtokenizer
bpe = BPEtokenizer()
bpe.train(text,1000,75,'test.vocab')

Load

bpe = bpe.load('test.vocab')

Encode and Decode

token_list = bpe.encode('Hello World this is the BPE tokenizer',4,False,True,True)
print(token_list)

for token in token_list:
   print(bpe.decode(token))

Print Vocab

print(bpe.tokens)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
BPE.py		BPE.py
README.md		README.md
py_wiki.vocab		py_wiki.vocab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BytePairEncoding-for-NLP

About

Releases

Packages

Languages

MaxxP0/BytePairEncoding-for-NLP

Folders and files

Latest commit

History

Repository files navigation

BytePairEncoding-for-NLP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages