Full-parameter finetune后，生成视频的主体空间扭曲 #116

CacacaLalala · 2024-08-12T08:57:17Z

您好！非常棒的开源repo~
最近尝试了Lora和full-parameter的finetune，均使用同样的50个video，微调500次迭代，其余setting没有修改
发现full-parameter的微调后，生成视频的主体会非常扭曲，lora的微调形式没有这种明显扭曲
下面是同样的prompt： spider making a web的结果：
full-parameter微调后

000000.mp4

lora微调后

000000.mp4

不知道导致这一问题的原因是什么？是微调时的lr太高的原因吗？
期待您的回复，感谢

zRzRzRzRzRzRzR · 2024-08-13T05:38:05Z

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

CacacaLalala · 2024-08-13T06:58:28Z

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！
我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。
是默认配置，training_config如下：
`args:
checkpoint_activations: true
model_parallel_size: 1
experiment_name: finetune-openvid-framesmin180-max500-origin-dataset
mode: finetune
load: CogVideoX-2b-sat/transformer
no_load_rng: true
train_iters: 10000
eval_iters: 1
eval_interval: 10000
eval_batch_size: 1
save: output
save_interval: 100
log_interval: 20
train_data:

dataset/mini_dataset/cogvideo/videos
valid_data:
dataset/mini_dataset/cogvideo/videos
split: 1,0,0
num_workers: 8
force_train: true
only_log_video_latents: true
data:
target: data_video.SFTDataset
params:
video_size:
- 480
- 720
  fps: 8
  max_num_frames: 49
  skip_frms_num: 3.0
  deepspeed:
  train_micro_batch_size_per_gpu: 1
  gradient_accumulation_steps: 1
  steps_per_print: 50
  gradient_clipping: 0.1
  zero_optimization:
  stage: 2
  cpu_offload: false
  contiguous_gradients: false
  overlap_comm: true
  reduce_scatter: true
  reduce_bucket_size: 1000000000
  allgather_bucket_size: 1000000000
  load_from_fp32_weights: false
  zero_allow_untested_optimizer: true
  bf16:
  enabled: false
  fp16:
  enabled: true
  loss_scale: 0
  loss_scale_window: 400
  hysteresis: 2
  min_loss_scale: 1
  optimizer:
  type: sat.ops.FusedEmaAdam
  params:
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
    eps: 1.0e-08
    weight_decay: 0.0001
    activation_checkpointing:
    partition_activations: false
    contiguous_memory_optimization: false
    wall_clock_breakdown: false
    model:
    scale_factor: 1.15258426
    disable_first_stage_autocast: true
    log_keys:
txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: false
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: true
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 30
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 1920
adm_in_channels: 256
num_attention_heads: 30
transformer_args:
checkpoint_activations: true
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Basic3DPositionEmbeddingMixin
params:
text_length: 226
height_interpolation: 1.875
width_interpolation: 1.875
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: true
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
  input_key: txt
  ucg_rate: 0.1
  target: sgm.modules.encoders.modules.FrozenT5Embedder
  params:
  model_dir: ckpts/cogvideo/t5-v1_1-xxl
  max_length: 226
  first_stage_config:
  target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
  params:
  cp_size: 1
  ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
  ignore_keys:
- loss
  loss_config:
  target: torch.nn.Identity
  regularizer_config:
  target: vae_modules.regularizers.DiagonalGaussianRegularizer
  encoder_config:
  target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
  params:
  double_z: true
  z_channels: 16
  resolution: 256
  in_channels: 3
  out_ch: 3
  ch: 128
  ch_mult:
  - 1
  - 2
  - 2
  - 4
    attn_resolutions: []
    num_res_blocks: 3
    dropout: 0.0
    gather_norm: true
    decoder_config:
    target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
    params:
    double_z: true
    z_channels: 16
    resolution: 256
    in_channels: 3
    out_ch: 3
    ch: 128
    ch_mult:
  - 1
  - 2
  - 2
  - 4
    attn_resolutions: []
    num_res_blocks: 3
    dropout: 0.0
    gather_norm: false
    loss_fn_config:
    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
    params:
    offset_noise_level: 0
    sigma_sampler_config:
    target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
    params:
    uniform_sampling: true
    num_idx: 1000
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    sampler_config:
    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
    params:
    num_steps: 50
    verbose: true
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    guider_config:
    target: sgm.modules.diffusionmodules.guiders.DynamicCFG
    params:
    scale: 6
    exp: 5
    num_steps: 50
    `
    不好意思我目前还没对repo做过多修改，loss还没有记录下来
    您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重
    500次迭代：

000000.mp4

4000次迭代：

000000.mp4

期待您的回复~

tengjiayan20 · 2024-08-13T09:07:48Z

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK.
We will update config files and fine-tune instructions soon.

CacacaLalala · 2024-08-13T09:11:37Z

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability.
Looking forward to your reply!

tengjiayan20 · 2024-08-13T09:27:40Z

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size?
And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

CacacaLalala · 2024-08-13T09:33:10Z

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

The total batch size is 24*2, and I'm using 100w dataset by changing dataset part. Next, waiting for more iterations, I test the training again. Thanks a lot!

GFENGG · 2024-08-16T05:42:30Z

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos
valid_data:

dataset/mini_dataset/cogvideo/videos
split: 1,0,0
num_workers: 8
force_train: true
only_log_video_latents: true
data:
target: data_video.SFTDataset
params:
video_size:

480

720
fps: 8
max_num_frames: 49
skip_frms_num: 3.0
deepspeed:
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: false
fp16:
enabled: true
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.0002
betas:

0.9

0.95
eps: 1.0e-08
weight_decay: 0.0001
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
log_keys:

txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: false
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: true
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 30
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 1920
adm_in_channels: 256
num_attention_heads: 30
transformer_args:
checkpoint_activations: true
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Basic3DPositionEmbeddingMixin
params:
text_length: 226
height_interpolation: 1.875
width_interpolation: 1.875
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: true
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:

is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: ckpts/cogvideo/t5-v1_1-xxl
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
ignore_keys:

loss
loss_config:
target: torch.nn.Identity
regularizer_config:
target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config:
target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: true
decoder_config:
target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: false
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
params:
offset_noise_level: 0
sigma_sampler_config:
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
uniform_sampling: true
num_idx: 1000
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
sampler_config:
target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
params:
num_steps: 50
verbose: true
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
guider_config:
target: sgm.modules.diffusionmodules.guiders.DynamicCFG
params:
scale: 6
exp: 5
num_steps: 50
`
不好意思我目前还没对repo做过多修改，loss还没有记录下来
您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重
500次迭代：

000000.mp4
4000次迭代：

000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

CacacaLalala · 2024-08-16T07:45:12Z

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos
valid_data:

dataset/mini_dataset/cogvideo/videos
split: 1,0,0
num_workers: 8
force_train: true
only_log_video_latents: true
data:
target: data_video.SFTDataset
params:
video_size:

480

720
fps: 8
max_num_frames: 49
skip_frms_num: 3.0
deepspeed:
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: false
fp16:
enabled: true
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.0002
betas:

0.9

0.95
eps: 1.0e-08
weight_decay: 0.0001
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
log_keys:

txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: false
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: true
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 30
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 1920
adm_in_channels: 256
num_attention_heads: 30
transformer_args:
checkpoint_activations: true
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Basic3DPositionEmbeddingMixin
params:
text_length: 226
height_interpolation: 1.875
width_interpolation: 1.875
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: true
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:

is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: ckpts/cogvideo/t5-v1_1-xxl
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
ignore_keys:

loss
loss_config:
target: torch.nn.Identity
regularizer_config:
target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config:
target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: true
decoder_config:
target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: false
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
params:
offset_noise_level: 0
sigma_sampler_config:
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
uniform_sampling: true
num_idx: 1000
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
sampler_config:
target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
params:
num_steps: 50
verbose: true
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
guider_config:
target: sgm.modules.diffusionmodules.guiders.DynamicCFG
params:
scale: 6
exp: 5
num_steps: 50
`
不好意思我目前还没对repo做过多修改，loss还没有记录下来
您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重
500次迭代：

000000.mp4
4000次迭代：
000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。
目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

GFENGG · 2024-08-16T08:46:50Z

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos
valid_data:

dataset/mini_dataset/cogvideo/videos
split: 1,0,0
num_workers: 8
force_train: true
only_log_video_latents: true
data:
target: data_video.SFTDataset
params:
video_size:

480

720
fps: 8
max_num_frames: 49
skip_frms_num: 3.0
deepspeed:
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: false
fp16:
enabled: true
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.0002
betas:

0.9

0.95
eps: 1.0e-08
weight_decay: 0.0001
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
log_keys:

txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: false
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: true
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 30
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 1920
adm_in_channels: 256
num_attention_heads: 30
transformer_args:
checkpoint_activations: true
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Basic3DPositionEmbeddingMixin
params:
text_length: 226
height_interpolation: 1.875
width_interpolation: 1.875
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: true
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:

is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: ckpts/cogvideo/t5-v1_1-xxl
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
ignore_keys:

loss
loss_config:
target: torch.nn.Identity
regularizer_config:
target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config:
target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: true
decoder_config:
target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:

1

2

2

4
attn_resolutions: []
num_res_blocks: 3
dropout: 0.0
gather_norm: false
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
params:
offset_noise_level: 0
sigma_sampler_config:
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
uniform_sampling: true
num_idx: 1000
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
sampler_config:
target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
params:
num_steps: 50
verbose: true
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
guider_config:
target: sgm.modules.diffusionmodules.guiders.DynamicCFG
params:
scale: 6
exp: 5
num_steps: 50
`
不好意思我目前还没对repo做过多修改，loss还没有记录下来
您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重
500次迭代：

000000.mp4
4000次迭代：
000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

我也在尝试finetune，所以空间结构不合理的问题是靠调小学习率 + 长时间训练解决的么？

CacacaLalala · 2024-08-19T07:31:45Z

理的问题是靠调小学习率 + 长时间训练解

目前看是这样

a-r-r-o-w · 2024-09-03T12:03:52Z

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?
How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?
What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.
Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?
Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs
How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.
Has anyone successfully trained a LoRA with lower rank than 128 producing good results?
What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?
Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

rainbow979 · 2024-11-30T13:07:37Z

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?

How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?

What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.

Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?

Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs

How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.

Has anyone successfully trained a LoRA with lower rank than 128 producing good results?

What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?

Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

@a-r-r-o-w Have you got any answers? I'm also very curious about.

a-r-r-o-w · 2024-11-30T13:33:39Z

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity
2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising
Initialization does not seem to have much effect. peft defaults are great
Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too
The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now
We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!
Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)
For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU
Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

rainbow979 · 2024-12-01T17:48:34Z

Thanks a lot for replying! They are very helpful. I have one more question: so we don't need to use EMA model to train?

a-r-r-o-w · 2024-12-01T18:02:05Z

I think there was a good recent paper that showed EMA is not particularly helpful for LoRA training, but the results with it are not too qualitatively different without it. It's really hard to see any benefits on small scale runs atleast (<10k steps in my tests), given the added memory requirement

crj1998 · 2024-12-05T09:48:32Z

The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem?

crj1998 · 2024-12-11T16:40:25Z

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity

2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising

Initialization does not seem to have much effect. peft defaults are great

Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too

The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now

We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!

Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)

For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU

Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem? (The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now)

zRzRzRzRzRzRzR self-assigned this Aug 13, 2024

tengjiayan20 self-assigned this Aug 15, 2024

CacacaLalala closed this as completed Aug 16, 2024

a-r-r-o-w mentioned this issue Sep 3, 2024

[training] CogVideoX Lora huggingface/diffusers#9302

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-parameter finetune后，生成视频的主体空间扭曲 #116

Full-parameter finetune后，生成视频的主体空间扭曲 #116

CacacaLalala commented Aug 12, 2024

zRzRzRzRzRzRzR commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

tengjiayan20 commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

tengjiayan20 commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

GFENGG commented Aug 16, 2024

CacacaLalala commented Aug 16, 2024

GFENGG commented Aug 16, 2024

CacacaLalala commented Aug 19, 2024

a-r-r-o-w commented Sep 3, 2024

rainbow979 commented Nov 30, 2024

a-r-r-o-w commented Nov 30, 2024

rainbow979 commented Dec 1, 2024

a-r-r-o-w commented Dec 1, 2024

crj1998 commented Dec 5, 2024

crj1998 commented Dec 11, 2024

Full-parameter finetune后，生成视频的主体空间扭曲 #116

Full-parameter finetune后，生成视频的主体空间扭曲 #116

Comments

CacacaLalala commented Aug 12, 2024

zRzRzRzRzRzRzR commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

tengjiayan20 commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

tengjiayan20 commented Aug 13, 2024

CacacaLalala commented Aug 13, 2024

GFENGG commented Aug 16, 2024

CacacaLalala commented Aug 16, 2024

GFENGG commented Aug 16, 2024

CacacaLalala commented Aug 19, 2024

a-r-r-o-w commented Sep 3, 2024

rainbow979 commented Nov 30, 2024

a-r-r-o-w commented Nov 30, 2024

rainbow979 commented Dec 1, 2024

a-r-r-o-w commented Dec 1, 2024

crj1998 commented Dec 5, 2024

crj1998 commented Dec 11, 2024