Paper | Model | Website and Examples | Audio Manipulation Notebooks | Hugging Face Models | Google Colab
Auffusion is a latent diffusion model (LDM) for text-to-audio (TTA) generation. Auffusion can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
-
2024/01/02: 📣 Colab notebooks for audio manipulation is released. Feel free to try!
-
2023/12/31: 📣 Auffusion release.Demo website and 3 models are released in Hugging Face.
Model Name | Model Path |
---|---|
Auffusion | https://huggingface.co/auffusion/auffusion |
Auffusion-Full | https://huggingface.co/auffusion/auffusion-full |
Auffusion-Full-no-adapter | https://huggingface.co/auffusion/auffusion-full-no-adapter |
Our code is built on pytorch version 2.0.1. We mention torch==2.0.1
in the requirements file but you might need to install a specific cuda version of torch depending on your GPU device type. We also depend on diffusers==0.18.2
.
Install requirements.txt
.
git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt
You might also need to install libsndfile1
for soundfile to work properly in linux:
(sudo) apt-get install libsndfile1
Download the Auffusion model and generate audio from a text prompt:
import IPython, torch
import soundfile as sf
from auffusion_pipeline import AuffusionPipeline
pipeline = AuffusionPipeline.from_pretrained("auffusion/auffusion")
prompt = "Birds singing sweetly in a blooming garden"
output = pipeline(prompt=prompt)
audio = output.audios[0]
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
The auffusion model will be automatically downloaded from Hugging Face and saved in cache. Subsequent runs will load the model directly from cache.
The generate
function uses 100 steps and 7.5 guidance_scale by default to sample from the latent diffusion model. You can also vary parameters for different results.
prompt = "Rolling thunder with lightning strikes"
output = pipeline(prompt=prompt, num_inference_steps=100, guidance_scale=7.5)
audio = output.audios[0]
IPython.display.Audio(data=audio, rate=16000)
More generated samples are shown here. You can also try out the colab notebook to generate your own audio samples.
To perform audio generation in AudioCaps test set from our Hugging Face checkpoints:
python inference.py \
--pretrained_model_name_or_path="auffusion/auffusion" \
--test_data_dir="./data/test_audiocaps.raw.json" \
--output_dir="./output/auffusion_hf" \
--enable_xformers_memory_efficient_attention \
We use the evaluation tools from https://github.com/haoheliu/audioldm_eval to evaluate our models, and we adopt https://huggingface.co/laion/clap-htsat-unfused to compute CLAP score.
Some data instances originally released in AudioCaps have since been removed from YouTube and are no longer available. We thus evaluated our models on all the instances which were available as June, 2023.
We show some examples of audio manipulation using Auffusion. Current audio manipulation methods include:
- Text-to-audio generation: notebook or colab
- Text-guided style transfer: notebook or colab
- Audio inpainting: notebook or colab
- attention-based word swap control: notebook or colab
- attention-based reweight control: notebook or colab
The audio manipulation code examples can all be found in notebooks.
- Publish demo website and arxiv link.
- Publish Auffusion and Auffusion-Full checkpoints.
- Add text-guided style transfer.
- Add audio-to-audio generation.
- Add audio inpainting.
- Add word_swap and reweight prompt2prompt-based control.
- Add audio super-resolution.
- Build Gradio web application.
- Add audio-to-audio, inpainting into Gradio web application.
- Add style-transfer into Gradio web application.
- Add audio super-resolution into Gradio web application.
- Add prompt2prompt-based control into Gradio web application.
- Add data preprocess and training code.
Please consider citing the following article if you found our work useful:
@article{xue2024auffusion,
title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation},
author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
journal={arXiv preprint arXiv:2401.01044},
year={2024}
}
Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.
-
https://github.com/huggingface/diffusers
-
https://github.com/huggingface/transformers
-
https://github.com/google/prompt-to-prompt
-
https://github.com/riffusion/riffusion
-
https://github.com/haoheliu/audioldm_eval
If you have any problems regarding the paper, code, models, or the project itself, please feel free to open an issue or contact Jinlong Xue directly :)