MIDI-LLM-tokenizer

Tools for converting .mid files into text for training large language models. The exact format of the intermediary text is not critical. The primary goals are to 1. create a "legible" text format that can be easily used with existing LLM tech, and 2. encode midi data into an efficient format for training.

Expected workflow would be:

convert midi datasets into jsonl files
tokenize jsonl files into binidx
train LLM
sample LLM
convert text into midi

For converting jsonl files to binidx format for training, see https://github.com/Abel2076/json2binidx_tool or https://github.com/EleutherAI/gpt-neox

Vocabulary

MIDI files contain a lot of data, and only some of it can be reasonably learned by a language model. Inspired by OpenAI MuseNet and Oore et. al, 2018, we have two main types of tokens:

Wait tokens for timing (125 of them, representing real time)
Combined note+velocity+instrument tokens (128 notes * 16 quantized velocity * 16 binned instruments = 32768 tokens)
pad/start/end tokens.

Notes and quantized velocities are encoded as hex, while instruments are encoded as the shortest unique string.

We (knowingly) discard the following information:

Panning
Pitch bend
Modulation
Key signature
Time signature
Track names
Instrument names (we assume the standard GM instruments)

Simultaneous tokens (e.g. chords) are sorted by instrument, note, then MIDI event order. This reduces unnecessary randomness in the data.

In the future, instrument order could be uniformly randomized to allow constrained sampling where you provide a preexisting track and the model generates a melody.

Scripts

build_tokenizer.py

Builds a new huggingface tokenizer and vocab using vocab_config.json

python ./build_tokenizer.py

midi_to_jsonl.py

Converts a directory or archive of mid/midi files into a jsonl file of note sequences.

python ./midi_to_jsonl.py --path ~/lmd_full.tar.gz --output ~/lmd_full.jsonl --workers 4

midi/text conversion

python ./midi_to_str.py test.mid

python ./str_to_midi.py "<start> p:3c:f t10 p:3e:f t10 p:40:f t64 p:3c:0 p:3e:0 p:40:0 <end>" --output test.mid

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
analysis		analysis
test		test
tokenizer-midi		tokenizer-midi
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
augment_config_max.json		augment_config_max.json
augment_config_med.json		augment_config_med.json
augment_config_min.json		augment_config_min.json
build_tokenizer.py		build_tokenizer.py
filter_config.json		filter_config.json
midi_to_jsonl.py		midi_to_jsonl.py
midi_to_str.py		midi_to_str.py
midi_util.py		midi_util.py
requirements.txt		requirements.txt
str_to_midi.py		str_to_midi.py
tokenizer-midi.json		tokenizer-midi.json
tokenizer-midipiano.json		tokenizer-midipiano.json
vocab_config.json		vocab_config.json
vocab_config_piano.json		vocab_config_piano.json
vocab_config_v2_wip.json		vocab_config_v2_wip.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIDI-LLM-tokenizer

Vocabulary

Scripts

build_tokenizer.py

midi_to_jsonl.py

midi/text conversion

About

Releases

Packages

Contributors 2

Languages

License

briansemrau/MIDI-LLM-tokenizer

Folders and files

Latest commit

History

Repository files navigation

MIDI-LLM-tokenizer

Vocabulary

Scripts

build_tokenizer.py

midi_to_jsonl.py

midi/text conversion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages