Skip to content

Make-An-Audio-3: Transforming Text/Video into Audio via Flow-based Large Diffusion Transformers

Notifications You must be signed in to change notification settings

Text-to-Audio/Make-An-Audio-3

Repository files navigation

Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers

PyTorch Implementation of Lumina-t2x, Lumina-Next

We will provide our implementation and pre-trained models as open-source in this repository recently.

arXiv Hugging Face GitHub Stars

News

Install dependencies

Note: You may want to adjust the CUDA version according to your driver version.

conda create -n Make_An_Audio_3 -y
conda activate Make_An_Audio_3
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Install [nvidia apex](https://github.com/nvidia/apex) (optional)

Quick Started

Pretrained Models

Simply download the 500M weights from Hugging Face

Model Config Pretraining Data Path
M (160M) txt2audio-cfm-cfg AudioCaption [TBD]
L (520M) / AudioCaption [TBD]
XL (750M) txt2audio-cfm-cfg-XL AudioCaption Here
3B / AudioCaption [TBD]
M (160M) txt2music-cfm-cfg Music Here
L (520M) / Music [TBD]
XL (750M) / Music [TBD]
3B / Music [TBD]
M (160M) video2audio-cfm-cfg-moe VGGSound Here

Generate audio/music from text

python3 scripts/txt2audio_for_2cap_flow.py  --prompt {TEXT}
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm-cfg.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat 

Add --test-dataset structure for text-to-audio generation

Generate audio/music from audiocaps or musiccaps test dataset

  • remember to alter config["test_dataset"]
python3 scripts/txt2audio_for_2cap_flow.py
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm-cfg.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset

Generate audio from video

python3 scripts/video2audio_flow.py 
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/video2audio-cfm-cfg-moe.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound 

Train flow-matching DiT

After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train Diffusion model

python main.py --base configs/txt2audio-cfm-cfg.yaml -t  --gpus 0,1,2,3,4,5,6,7

Others

For Data preparation, Training variational autoencoder, Evaluation, Please refer to Make-An-Audio.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio, AudioLCM, CLAP, as described in our code.

Citations

If you find this code useful in your research, please consider citing:

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

Make-An-Audio-3: Transforming Text/Video into Audio via Flow-based Large Diffusion Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published