Merge pull request #642 from THUDM/CogVideoX_dev

New Lora 20250108
THUDM · Jan 8, 2025 · 8f1829f · 8f1829f
2 parents aa240dc + 045e1b3
commit 8f1829f
Show file tree

Hide file tree

Showing 36 changed files with 804 additions and 627 deletions.
diff --git a/.gitignore b/.gitignore
@@ -22,3 +22,4 @@ venv
 **/results
 **/*.mp4
 **/validation_set
+CogVideo-1.0
diff --git a/README.md b/README.md
@@ -22,8 +22,9 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
 
 ## Project Updates
 
-- 🔥🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
-- 🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
+- 🔥🔥 **News**: ```2025/01/08```: We have updated the code for `Lora` fine-tuning based on the `diffusers` version model, which uses less GPU memory. For more details, please see [here](finetune/README.md).
+- 🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
+- 🔥 **News**: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
 The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. 
 The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
 - 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single

diff --git a/README_ja.md b/README_ja.md
@@ -22,7 +22,8 @@
 
 ## 更新とニュース
 
-- 🔥🔥 **ニュース**: ```2024/11/15```: `CogVideoX1.5`モデルのdiffusersバージョンをリリースしました。わずかなパラメータ調整で以前のコードをそのまま利用可能です。
+- 🔥🔥 **ニュース**: ```2025/01/08```: 私たちは`diffusers`バージョンのモデルをベースにした`Lora`微調整用のコードを更新しました。より少ないVRAM（ビデオメモリ）で動作します。詳細については[こちら](finetune/README_ja.md)をご覧ください。
+- 🔥 **ニュース**: ```2024/11/15```: `CogVideoX1.5`モデルのdiffusersバージョンをリリースしました。わずかなパラメータ調整で以前のコードをそのまま利用可能です。
 - 🔥 **ニュース**: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
 CogVideoX1.5-5B シリーズモデルは、10秒 長の動画とより高い解像度をサポートしており、`CogVideoX1.5-5B-I2V` は任意の解像度での動画生成に対応しています。
 SAT コードはすでに更新されており、`diffusers` バージョンは現在適応中です。

diff --git a/README_zh.md b/README_zh.md
@@ -22,7 +22,8 @@
 
 ## 项目更新
 
-- 🔥🔥 **News**: ```2024/11/15```: 我们发布 `CogVideoX1.5` 模型的diffusers版本，仅需调整部分参数仅可沿用之前的代码。
+- 🔥🔥 **News**: ```2025/01/08```: 我们更新了基于`diffusers`版本模型的`Lora`微调代码，占用显存更低，详情请见[这里](finetune/README_zh.md)。
+- 🔥 **News**: ```2024/11/15```: 我们发布 `CogVideoX1.5` 模型的diffusers版本，仅需调整部分参数仅可沿用之前的代码。
 - 🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。 
 CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨率，其中 `CogVideoX1.5-5B-I2V` 支持 **任意分辨率** 的视频生成，SAT代码已经更新。`diffusers`版本还在适配中。SAT版本代码前往 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 下载。
 - 🔥**News**: ```2024/10/13```: 成本更低，单卡4090可微调 `CogVideoX-5B`

diff --git a/finetune/README.md b/finetune/README.md
@@ -1,126 +1,102 @@
-# CogVideoX diffusers Fine-tuning Guide
+# CogVideoX Diffusers Fine-tuning Guide
 
 [中文阅读](./README_zh.md)
 
 [日本語で読む](./README_ja.md)
 
-This feature is not fully complete yet. If you want to check the fine-tuning for the SAT version, please
-see [here](../sat/README_zh.md). The dataset format is different from this version.
+If you're looking for the fine-tuning instructions for the SAT version, please check [here](../sat/README_zh.md). The dataset format for this version differs from the one used here.
 
 ## Hardware Requirements
 
-+ CogVideoX-2B / 5B LoRA: 1 * A100 (5B need to use `--use_8bit_adam`)
-+ CogVideoX-2B SFT:  8 * A100 (Working)
-+ CogVideoX-5B-I2V is not supported yet.
+| Model                | Training Type   | Mixed Precision | Training Resolution (frames x height x width) | Hardware Requirements    |
+|---------------------|-----------------|----------------|---------------------------------------------|------------------------|
+| cogvideox-t2v-2b     | lora (rank128)  | fp16           | 49x480x720                                  | 16GB VRAM (NVIDIA 4080) |
+| cogvideox-t2v-5b     | lora (rank128)  | bf16           | 49x480x720                                  | 24GB VRAM (NVIDIA 4090) |
+| cogvideox-i2v-5b     | lora (rank128)  | bf16           | 49x480x720                                  | 24GB VRAM (NVIDIA 4090) |
+| cogvideox1.5-t2v-5b  | lora (rank128)  | bf16           | 81x768x1360                                 | 35GB VRAM (NVIDIA A100) |
+| cogvideox1.5-i2v-5b  | lora (rank128)  | bf16           | 81x768x1360                                 | 35GB VRAM (NVIDIA A100) |
+
 
 ## Install Dependencies
 
-Since the related code has not been merged into the diffusers release, you need to base your fine-tuning on the
-diffusers branch. Please follow the steps below to install dependencies:
+Since the relevant code has not yet been merged into the official `diffusers` release, you need to fine-tune based on the diffusers branch. Follow the steps below to install the dependencies:
 
 ```shell
 git clone https://github.com/huggingface/diffusers.git
-cd diffusers # Now in Main branch
+cd diffusers # Now on the Main branch
 pip install -e .
 ```
 
 ## Prepare the Dataset
 
-First, you need to prepare the dataset. The dataset format should be as follows, with `videos.txt` containing the list
-of videos in the `videos` directory:
+First, you need to prepare your dataset. Depending on your task type (T2V or I2V), the dataset format will vary slightly:
 
 ```
 .
 ├── prompts.txt
 ├── videos
-└── videos.txt
+├── videos.txt
+├── images     # (Optional) For I2V, if not provided, first frame will be extracted from video as reference
+└── images.txt # (Optional) For I2V, if not provided, first frame will be extracted from video as reference
 ```
 
-You can download
-the [Disney Steamboat Willie](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) dataset from
-here.
+Where:
+- `prompts.txt`: Contains the prompts
+- `videos/`: Contains the .mp4 video files
+- `videos.txt`: Contains the list of video files in the `videos/` directory
+- `images/`: (Optional) Contains the .png reference image files
+- `images.txt`: (Optional) Contains the list of reference image files
 
-This video fine-tuning dataset is used as a test for fine-tuning.
+You can download a sample dataset (T2V) [Disney Steamboat Willie](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset).
 
-## Configuration Files and Execution
+If you need to use a validation dataset during training, make sure to provide a validation dataset with the same format as the training dataset.
 
-The `accelerate` configuration files are as follows:
+## Run the Script to Start Fine-tuning
 
-+ `accelerate_config_machine_multi.yaml`: Suitable for multi-GPU use
-+ `accelerate_config_machine_single.yaml`: Suitable for single-GPU use
+Before starting the training, please note the following resolution requirements:
 
-The configuration for the `finetune` script is as follows:
+1. The number of frames must be a multiple of 8 **plus 1** (i.e., 8N+1), such as 49, 81, etc.
+2. The recommended resolution for videos is:
+   - CogVideoX: 480x720 (Height x Width)
+   - CogVideoX1.5: 768x1360 (Height x Width)
+3. For samples that do not meet the required resolution (videos or images), the code will automatically resize them. This may distort the aspect ratio and impact training results. We recommend preprocessing the samples (e.g., using crop + resize to maintain aspect ratio) before training.
 
-```
-accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \  # Use accelerate to launch multi-GPU training with the config file accelerate_config_machine_single.yaml
-  train_cogvideox_lora.py \  # Training script train_cogvideox_lora.py for LoRA fine-tuning on CogVideoX model
-  --gradient_checkpointing \  # Enable gradient checkpointing to reduce memory usage
-  --pretrained_model_name_or_path $MODEL_PATH \  # Path to the pretrained model, specified by $MODEL_PATH
-  --cache_dir $CACHE_PATH \  # Cache directory for model files, specified by $CACHE_PATH
-  --enable_tiling \  # Enable tiling technique to process videos in chunks, saving memory
-  --enable_slicing \  # Enable slicing to further optimize memory by slicing inputs
-  --instance_data_root $DATASET_PATH \  # Dataset path specified by $DATASET_PATH
-  --caption_column prompts.txt \  # Specify the file prompts.txt for video descriptions used in training
-  --video_column videos.txt \  # Specify the file videos.txt for video paths used in training
-  --validation_prompt "" \  # Prompt used for generating validation videos during training
-  --validation_prompt_separator ::: \  # Set ::: as the separator for validation prompts
-  --num_validation_videos 1 \  # Generate 1 validation video per validation round
-  --validation_epochs 100 \  # Perform validation every 100 training epochs
-  --seed 42 \  # Set random seed to 42 for reproducibility
-  --rank 128 \  # Set the rank for LoRA parameters to 128
-  --lora_alpha 64 \  # Set the alpha parameter for LoRA to 64, adjusting LoRA learning rate
-  --mixed_precision bf16 \  # Use bf16 mixed precision for training to save memory
-  --output_dir $OUTPUT_PATH \  # Specify the output directory for the model, defined by $OUTPUT_PATH
-  --height 480 \  # Set video height to 480 pixels
-  --width 720 \  # Set video width to 720 pixels
-  --fps 8 \  # Set video frame rate to 8 frames per second
-  --max_num_frames 49 \  # Set the maximum number of frames per video to 49
-  --skip_frames_start 0 \  # Skip 0 frames at the start of the video
-  --skip_frames_end 0 \  # Skip 0 frames at the end of the video
-  --train_batch_size 4 \  # Set training batch size to 4
-  --num_train_epochs 30 \  # Total number of training epochs set to 30
-  --checkpointing_steps 1000 \  # Save model checkpoint every 1000 steps
-  --gradient_accumulation_steps 1 \  # Accumulate gradients for 1 step, updating after each batch
-  --learning_rate 1e-3 \  # Set learning rate to 0.001
-  --lr_scheduler cosine_with_restarts \  # Use cosine learning rate scheduler with restarts
-  --lr_warmup_steps 200 \  # Warm up the learning rate for the first 200 steps
-  --lr_num_cycles 1 \  # Set the number of learning rate cycles to 1
-  --optimizer AdamW \  # Use the AdamW optimizer
-  --adam_beta1 0.9 \  # Set Adam optimizer beta1 parameter to 0.9
-  --adam_beta2 0.95 \  # Set Adam optimizer beta2 parameter to 0.95
-  --max_grad_norm 1.0 \  # Set maximum gradient clipping value to 1.0
-  --allow_tf32 \  # Enable TF32 to speed up training
-  --report_to wandb  # Use Weights and Biases (wandb) for logging and monitoring the training
-```
+> **Important Note**: To improve training efficiency, we will automatically encode videos and cache the results on disk. If you modify the data after training has begun, please delete the `latent` directory under the `videos/` folder to ensure that the latest data is used.
 
-## Running the Script to Start Fine-tuning
+### Text-to-Video (T2V) Fine-tuning
 
-Single Node (One GPU or Multi GPU) fine-tuning:
+```bash
+# Modify the configuration parameters in accelerate_train_t2v.sh
+# The main parameters to modify are:
+# --output_dir: Output directory
+# --data_root: Root directory of the dataset
+# --caption_column: Path to the prompt file
+# --video_column: Path to the video list file
+# --train_resolution: Training resolution (frames x height x width)
+# Refer to the start script for other important parameters
 
-```shell
-bash finetune_single_rank.sh
+bash accelerate_train_t2v.sh
 ```
 
-Multi-Node fine-tuning:
+### Image-to-Video (I2V) Fine-tuning
 
-```shell
-bash finetune_multi_rank.sh # Needs to be run on each node
+```bash
+# Modify the configuration parameters in accelerate_train_i2v.sh
+# In addition to modifying the same parameters as for T2V, you also need to set:
+# --image_column: Path to the reference image list file(if not provided, remove use this parameter)
+# Refer to the start script for other important parameters
+
+bash accelerate_train_i2v.sh
 ```
 
-## Loading the Fine-tuned Model
+## Load the Fine-tuned Model
 
-+ Please refer to [cli_demo.py](../inference/cli_demo.py) for how to load the fine-tuned model.
++ Please refer to [cli_demo.py](../inference/cli_demo.py) for instructions on how to load the fine-tuned model.
 
 ## Best Practices
 
-+ Includes 70 training videos with a resolution of `200 x 480 x 720` (frames x height x width). By skipping frames in
-  the data preprocessing, we created two smaller datasets with 49 and 16 frames to speed up experimentation, as the
-  maximum frame limit recommended by the CogVideoX team is 49 frames. We split the 70 videos into three groups of 10,
-  25, and 50 videos, with similar conceptual nature.
-+ Using 25 or more videos works best when training new concepts and styles.
-+ It works better to train using identifier tokens specified with `--id_token`. This is similar to Dreambooth training,
-  but regular fine-tuning without such tokens also works.
-+ The original repository used `lora_alpha` set to 1. We found this value ineffective across multiple runs, likely due
-  to differences in the backend and training setup. Our recommendation is to set `lora_alpha` equal to rank or rank //
-    2.
-+ We recommend using a rank of 64 or higher.
++ We included 70 training videos with a resolution of `200 x 480 x 720` (frames x height x width). Through frame skipping in the data preprocessing, we created two smaller datasets with 49 and 16 frames to speed up experiments. The maximum frame count recommended by the CogVideoX team is 49 frames. These 70 videos were divided into three groups: 10, 25, and 50 videos, with similar conceptual nature.
++ Videos with 25 or more frames work best for training new concepts and styles.
++ It's recommended to use an identifier token, which can be specified using `--id_token`, for better training results. This is similar to Dreambooth training, though regular fine-tuning without using this token will still work.
++ The original repository uses `lora_alpha` set to 1. We found that this value performed poorly in several runs, possibly due to differences in the model backend and training settings. Our recommendation is to set `lora_alpha` to be equal to the rank or `rank // 2`.
++ It's advised to use a rank of 64 or higher.