This repository is the MindSpore implementation of AnimateDiff.
- Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend 910*
- MotionLoRA inference
- Motion Module Training
- Motion LoRA Training
- AnimateDiff v3 Inference
mindspore | ascend driver | firmware | cann toolkit/kernel |
---|---|---|---|
2.3.1 | 24.1.RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 |
2.2.10 | 23.0.3 | 7.1.0.5.220 | 7.0.0.beta1 |
To install other dependent packages:
pip install -r requirements.txt
In case decord
package is not available, try pip install eva-decord
.
For EulerOS, instructions on ffmpeg and decord installation are as follows.
1. install ffmpeg 4, referring to https://ffmpeg.org/releases
wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
tar -xvf ffmpeg-4.0.1.tar.bz2
mv ffmpeg-4.0.1 ffmpeg
cd ffmpeg
./configure --enable-shared # --enable-shared is needed for sharing libavcodec with decord
make -j 64
make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
git clone --recursive https://github.com/dmlc/decord
cd decord
rm build && mkdir build && cd build
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
make -j 64
make install
cd ../python
python3 setup.py install --user
Please download the following weights from to huggingface.
The full tree of expected checkpoints is shown below:
models
├── domain_adapter_lora
│ └── v3_sd15_adapter.ckpt
├── dreambooth_lora
│ ├── realisticVisionV51_v51VAE.ckpt
│ └── toonyou_beta3.ckpt
├── motion_lora
│ └── v2_lora_ZoomIn.ckpt
├── motion_module
│ ├── mm_sd_v15.ckpt
│ ├── mm_sd_v15_v2.ckpt
│ └── v3_sd15_mm.ckpt
├── sparsectrl_encoder
│ ├── v3_sd15_sparsectrl_rgb.ckpt
│ └── v3_sd15_sparsectrl_scribble.ckpt
└── stable_diffusion
└── sd_v1.5-d0ab7146.ckpt
Then, put all the weights under animatediff/models/torch_ckpts/
and convert them by running the following command.
sh scripts/convert_weights.sh
# download demo images
bash scripts/download_demo_images.sh
# under general T2V setting
python text_to_video.py --config configs/prompts/v3/v3-1-T2V.yaml
# image animation (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-2-animation-RealisticVision.yaml
# sketch-to-animation and storyboarding (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-3-sketch-RealisticVision.yaml
Results:
Input (by RealisticVision) | Animation | Input | Animation |
Input Scribble | Output | Input Scribbles | Output |
The script uses DDIM sampling by default:
python text_to_video.py --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 512 --W 512
Results:
The script uses DDIM sampling by default:
python text_to_video.py --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 512 --W 512
Results using Zoom-In motion lora:
python train.py --config configs/training/image_finetune.yaml
Please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running train script if using mindspore 2.2.10.
Infer with the trained model by running:
python text_to_video.py --config configs/prompts/v2/base_video.yaml \
--pretrained_model_path {path to saved checkpoint} \
--prompt {text prompt} \
python train.py --config configs/training/mmv2_train.yaml
Please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running train script if using mindspore 2.2.10.
You may change the arguments including data path, output directory, lr, etc in the yaml config file. You can also change by command line arguments referring to args_train.py
or python train.py --help
Min-SNR weighting can improve diffusion training convergence. Enable it by appending --snr_gamma=5.0
to the training command.
Infer with the trained model by running:
python text_to_video.py --config configs/prompts/v2/base_video.yaml \
--motion_module_path {path to saved checkpoint} \
--prompt {text prompt} \
You can also create a new config yaml to specify the prompts to test and the motion moduel path based on configs/prompt/v2/base_video.yaml
.
Here are some generation results after MM training on 512x512 resolution and 16-frame data.
python train.py --config configs/training/mmv2_lora.yaml
Please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running train script if using mindspore 2.2.10.
Infer with the trained model by running:
python text_to_video.py --config configs/prompts/v2/base_video.yaml \
--motion_lora_path {path to saved checkpoint} \
--prompt {text prompt} \
Here are some generation results after lora fine-tuning on 512x512 resolution and 16-frame data.
Experiments are tested on ascend 910* graph mode.
- mindspore 2.3.1
model name | cards | resolution | scheduler | steps | s/step | s/video |
---|---|---|---|---|---|---|
AnimateDiff v2 | 1 | 512x512x16 | DDIM | 30 | 0.60 | 18.00 |
- mindspore 2.2.10
model name | cards | resolution | scheduler | steps | s/step | s/video |
---|---|---|---|---|---|---|
AnimateDiff v2 | 1 | 512x512x16 | DDIM | 30 | 1.20 | 25.00 |
- mindspore 2.3.1
method | cards | batch size | resolution | flash attn | jit level | s/step | img/s |
---|---|---|---|---|---|---|---|
MM training | 1 | 1 | 16x512x512 | ON | O0 | 1.320 | 0.75 |
Motion Lora | 1 | 1 | 16x512x512 | ON | O0 | 1.566 | 0.64 |
MM training w/ Embed. cached | 1 | 1 | 16x512x512 | ON | O0 | 1.004 | 0.99 |
Motion Lora w/ Embed. cached | 1 | 1 | 16x512x512 | ON | O0 | 1.009 | 0.99 |
- mindspore 2.2.10
method | cards | batch size | resolution | flash attn | jit level | s/step | img/s |
---|---|---|---|---|---|---|---|
MM training | 1 | 1 | 16x512x512 | OFF | N/A | 1.29 | 0.78 |
Motion Lora | 1 | 1 | 16x512x512 | OFF | N/A | 1.26 | 0.79 |
MM training w/ Embed. cached | 1 | 1 | 16x512x512 | ON | N/A | 0.75 | 1.33 |
Motion Lora w/ Embed. cached | 1 | 1 | 16x512x512 | ON | N/A | 0.71 | 1.49 |
MM training: Motion Module training.
Embed. cached: The video embedding (VAE-encoder outputs) and text embedding are pre-computed and stored before diffusion training.