Byte Pair Encoding

Overview

The Byte-Pair Encoder (BPE) is a powerful tokenization method widely used in natural language processing. This Python implementation of BPE is inspired by the paper Neural Machine Translation of Rare Words with Subword Units and guided by Lei Mao's educational tutorial.

Features

Tokenization: Efficient tokenization using Byte-Pair Encoding.
Vocabulary Management: Tools for managing and analyzing vocabulary.
Token Pair Frequency: Calculate token pair frequencies for subword units.

Getting Started

To get started with Byte-Pair Encoder, follow these simple steps:

Clone the Repository

git clone https://github.com/teleprint-me/byte-pair.git

Install Dependencies

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the Code

python -m byte_pair.encode --input_file samples/taming_shrew.md --output_file local/vocab.json --n_merges 5000

Usage

For comprehensive usage instructions and options, consult the documentation:

python -m byte_pair.encode --help

Documentation

Detailed information on how to use and contribute to the project is available in the documentation.

Contributing

Contributions are welcome! If you have suggestions, bug reports, or improvements, please don't hesitate to submit issues or pull requests.

License

This project is licensed under the AGPL (GNU Affero General Public License). For detailed information, see the LICENSE file.

Acknowledgments

Special thanks to Lei Mao for the blog tutorial that inspired this implementation.

Additional Resources

Original Paper: A New Algorithm for Data Compression Optimization
Johns Hopkins Paper: A Formal Perspective on Byte-Pair Encoding
Amazon Research: A Statistical Extension of Byte-Pair Encoding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Byte Pair Encoding

Overview

Features

Getting Started

Usage

Documentation

Contributing

License

Acknowledgments

Additional Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Byte Pair Encoding

Overview

Features

Getting Started

Usage

Documentation

Contributing

License

Acknowledgments

Additional Resources