This is a repository for generating short audio samples and training waveform diffusion models on a consumer-grade GPU with less than 2GB VRAM.
The purpose of this project is to provide access to stereo high-resolution (44.1kHz) conditional and unconditional audio waveform (1D U-Net) diffusion code for those interested in exploration but who have limited resources. There are many methods for audio generation on low-level hardware, but less so specifically for waveform-based diffusion.
The repository is built heavily adapting code from Archinet's audio-diffusion-pytorch libary. A huge thank you to Flavio Schneider for his incredible open-source work in this field!
Direct waveform diffusion is inherently computationally intensive. For example, an audio sample with the industry standard 44.1kHz sampling rate requires 44,100 samples for just 1 second of audio. Now multiply that by 2 for a stereo file. However, it has a significant advantage over many methods that reduce audio into spectrograms or downsample - the network retains and learns from phase information. Phase is challenging to represent on its own in visual methods, such as spectrograms, as it appears similar to that of random noise. Because of this, many generative methods discard phase information and then implement ways of estimating and regenerating it. However, it plays a key role in defining the timbral qualities of sounds and should not be dispensed with so easily.
Waveform diffusion is able to retain this important feature as it does not perform any transforms on the audio before feeding it into the network. This is how humans perceive sounds, with both amplitude and phase information bundled together in a single signal. As mentioned previously, this comes at the expense of computational requirements and is often reserved for training on a cluster of GPUs with high speeds and lots of memory. Because of this, it is hard to begin to experiment with waveform diffusion with limited resources.
This repository seeks to offer some base code to those looking to experiment with and learn more about waveform diffusion on their own computer without having to purchase cloud resources or upgrade hardware. This goes for not only inference, but training your own models as well!
To make this feasible, however, there must be a tradeoff of quality, speed, and sample length. Because of this, I have focused on training base models for one-shot drum samples - as they are inherently short in sample length.
The current configuration is set up to be able to train ~0.75 second stereo samples at 44.1kHz, allowing for the generation of high-quality one-shot audio samples. The network configuration can be adjusted to improve the resolution, sample rate, training and inference speed, sample length, etc. but, of course, more hardware resources will be required.
Other methods of diffusion, such as diffusion in the latent space (Stable Diffusion's secret sauce), compared to this repo's raw waveform diffusion can offer an improvement and other tradeoffs between quality, memory requirements, speed, etc. I recommend this repo to remain up-to-date with the latest research in generative audio: https://github.com/archinetai/audio-ai-timeline
Also recommended is Harmonai's community project, Dance Diffusion, which implements similar functionality to this repo on a larger scale with several pre-trained models. Colab notebook available.
April 2024 update:
Some additional useful generative audio tools/repos:
- Stable Audio Tools (used in Stable Audio) - Useful audio tools for building and training models.
- audiocraft (used in MusicGen & AudioGen) - Useful audio tools for building and training models.
- audiomentations - Good library for implementing audio augmentations on CPU for training. See torch-audiomentations for GPU implementation.
Follow these steps to set up an environment for both generating audio samples and training models.
NOTE: To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed for your specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.
Ensure that Anaconda (or Miniconda) is installed and activated. From the command line, cd
into the setup/
folder and run the following lines:
conda env create -f environment.yml
conda activate tiny-audio-diffusion
This will create and activate a conda environment from the setup/environment.yml
file and install the dependencies in setup/requirements.txt
.
Run the following line to create a kernel for the current environment to run the inference notebook.
python -m ipykernel install --user --name tiny-audio-diffusion --display-name "tiny-audio-diffusion (Python 3.10)"
Rename .env.tmp
to .env
and replace the entries with your own variables (example values are random).
DIR_LOGS=/logs
DIR_DATA=/data
# Required if using Weights & Biases (W&B) logger
WANDB_PROJECT=tiny_drum_diffusion # Custom W&B name for current project
WANDB_ENTITY=johnsmith # W&B username
WANDB_API_KEY=a21dzbqlybbzccqla4txa21dzbqlybbzccqla4tx # W&B API key
NOTE: Sign up for a Weights & Biases account to log audio samples, spectrograms, and other metrics while training (it's free!).
W&B logging example for this repo here.
Pretrained models can be found on Hugging Face (each model contains a .ckpt
and .yaml
file):
Model | Link |
---|---|
Kicks | crlandsc/tiny-audio-diffusion-kicks |
Snares | crlandsc/tiny-audio-diffusion-snares |
Hi-hats | crlandsc/tiny-audio-diffusion-hihats |
Percussion (all drum types) | crlandsc/tiny-audio-diffusion-percussion |
See W&B model training metrics here.
Pre-trained models can be downloaded to generate samples via the inference notebook. They can also be used as a base model to fine-tune on custom data. It is recommended to create subfolders within the saved_models
folder to store each model's .ckpt
and .yaml
files.
Generate samples without code on π€ Hugging Face Spaces!
Current Capabilities:
- Unconditional Generation
- Conditional "Style-transfer" Generation
Open the Inference.ipynb
in Jupyter Notebook and follow the instructions to generate new audio samples. Ensure that the "tiny-audio-diffusion (Python 3.10)"
kernel is active in Jupyter to run the notebook and you have downloaded the pre-trained model of interest from Hugging Face.
The model architecture has been constructed with PyTorch Lightning and Hydra frameworks. All configurations for the model are contained within .yaml
files and should be edited there rather than hardcoded.
exp/drum_diffusion.yaml
contains the default model configuration. Additional custom model configurations can be added to the exp
folder.
Custom models can be trained or fine-tuned on custom datasets. Datasets should consist of a folder of .wav
audio files with a 44.1kHz sampling rate.
To train or finetune models, run one of the following commands in the terminal from the repo's root folder and replace <path/to/your/train/data>
with the path to your custom training set.
Train model from scratch (on CPU): (not recommended)
python train.py exp=drum_diffusion datamodule.dataset.path=<path/to/your/train/data>
Train model from scratch (on GPU):
python train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=<path/to/your/train/data>
NOTE: To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.
Resume run from a checkpoint (with GPU):
python train.py exp=drum_diffusion trainer.gpus=1 +ckpt=</path/to/checkpoint.ckpt> datamodule.dataset.path=<path/to/your/train/data>
The data used to train the checkpoints listed above can be found on π€ Hugging Face.
Note: This is a small and unbalanced dataset consisting of free samples that I had from my music production. These samples are not covered under the MIT license of this repository and cannot be used to train any commercial models, but can be used in personal and research contexts.
Note: For appropriately diverse models, larger datasets should be used to avoid memorization of training data.
The structure of this repository is as follows:
βββ main
β βββ diffusion_module.py - contains pl model, data loading, and logging functionalities for training
β βββ utils.py - contains utility functions for training
βββ exp
β βββ *.yaml - Hydra configuration files
βββ setup
β βββ environment.yml - file to set up conda environment
β βββ requirements.txt - contains repo dependencies
βββ images - directory containing images for README.md
β βββ *.png
βββ samples - directory containing sample outputs from tiny-audio-diffusion models
β βββ *.wav
βββ .env.tmp - temporary environment variables (rename to .env)
βββ .gitignore
βββ README.md
βββ Inference.ipynb - Jupyter notebook for running inference to generate new samples
βββ config.yaml - Hydra base configs
βββ train.py - script for training
βββ data - directory to host custom training data
β βββ wav_dataset
β βββ (*.wav)
βββ saved_models - directory to host model checkpoints and hyper-parameters for inference
βββ (kicks/snare/etc.)
βββ (*.ckpt) - pl model checkpoint file
βββ (config.yaml) - pl model hydra hyperparameters (required for inference)