[Paper] [Companion website] [🤗 TSD 20k] [🤗 REMI 20k]
Byte Pair Encoding (BPE) is a compression technique that allows to reduce the sequence length of a corpus by iteratively replacing the most recurrent byte successions by newly created symbols. It is widely used in NLP, as it allows to automatically create vocabularies made of words or parts of words.
In this paper, we show that it can address two main concerns about how symbolic music was previously tokenized:
- The fairly long sequence length resulting by using one token per note attribute (e.g. pitch, duration) and time events. Long sequences is problematic as the time and space complexity of Transformer models grows quadratically with the input sequence.
- The poor usage of the model's embedding space. Language models first project tokens into a learned embedding space, in which the embeddings (continuous representations of the tokens) are learnt to represent their semantic information. This is an essential feature of such models, as it allows them to capture the meaning of the tokens and data. In symbolic music, the tokens usually only represent note attribute values or time values, which do not carry much information other than their absolute value. And vocabularies range often between 200 and 500 tokens, which are then represented on 512 to 1024 dimensions. In such conditions, the embedding space is misused and the potential of the model is poorly exploited.
When applied on symbolic music, BPE will allow to drastically reduce the sequence length, while creating new tokens that can represent whole notes, and sequences of notes. The model's efficiency is then greatly improved, while bringing more information per tokens. It greatly improves the quality of generation, while improving up to three times the inference speed.
BPE is fully implemented within MidiTok, allowing you to easily benefit from this method on top of most of the existing tokenizations.
We invite you to read the paper, and check our companion website to listen generated results!
Finally, the best models are shared on Hugging Face: TSD 20k and REMI 20k
pip install -r requirements
to install requirements- Download the Maestro and MMD datasets and put them in
data/
python scripts/preprocess_maestro.py
andpython scripts/preprocess_for_octuple.py
python scripts/tokenize_datasets.py
to tokenize data and learn BPEpython exp_generation.py
to train generative models and generate resultspython exp_pretrain.py
to pretrain classification modelspython exp_cla.py
to train classification models and test them
Scripts can be run to get reproduce the analysis.
@inproceedings{bpe-symbolic-music,
title = "Byte Pair Encoding for Symbolic Music",
author = "Fradet, Nathan and
Gutowski, Nicolas and
Chhel, Fabien and
Briot, Jean-Pierre",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.123",
doi = "10.18653/v1/2023.emnlp-main.123",
pages = "2001--2020",
}