This repo contains PyTorch model definitions, pre-trained weights, and training/sampling code for paper Flux that plays music. It explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation. The model architecture can be seen as follows:
You can refer to the link to build the running environment.
To launch small version in the latent space training with N
GPUs on one node with pytorch DDP:
torchrun --nnodes=1 --nproc_per_node=N train.py \
--version small \
--data-path xxx \
--global_batch_size 128
More scripts of different model size can reference to scripts
file direction.
We include a sample.py
script which samples music clips according to conditions from a MusicFlux model as:
python sample.py \
--version small \
--ckpt_path /path/to/model \
--prompt_file config/example.txt
All prompts used in paper are lists in config/example.txt
.
We use VAE and Vocoder in AudioLDM2, CLAP-L, and T5-XXL. You can download in the following table directly, we also provide the training scripts in our experiments.
Note that in actual experiments, a restart experiment was performed due to machine malfunction, so there will be resume options in some scripts.
Model | Url | Training scripts |
---|---|---|
VAE | link | - |
Vocoder | link | - |
T5-XXL | link | - |
CLAP-L | link | - |
FluxMusic-Small | link | link |
FluxMusic-Base | link | link |
FluxMusic-Large | link | link |
FluxMusic-Giant | link | link |
The construction of training data can refer to the test.py
file, showing a simple build of combing differnet datasets in json file.
Considering copyright issues, the data used in the paper needs to be downloaded by oneself. A quick download link can be found in Huggingface : ).
The codebase is based on the awesome Flux and AudioLDM2 repos.