Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-parameter finetune后,生成视频的主体空间扭曲 #116

Closed
CacacaLalala opened this issue Aug 12, 2024 · 17 comments
Closed

Full-parameter finetune后,生成视频的主体空间扭曲 #116

CacacaLalala opened this issue Aug 12, 2024 · 17 comments
Assignees

Comments

@CacacaLalala
Copy link

您好!非常棒的开源repo~
最近尝试了Lora和full-parameter的finetune,均使用同样的50个video,微调500次迭代,其余setting没有修改
发现full-parameter的微调后,生成视频的主体会非常扭曲,lora的微调形式没有这种明显扭曲
下面是同样的prompt: spider making a web的结果:
full-parameter微调后

000000.mp4

lora微调后

000000.mp4

不知道导致这一问题的原因是什么?是微调时的lr太高的原因吗?
期待您的回复,感谢

@zRzRzRzRzRzRzR
Copy link
Member

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Aug 13, 2024
@CacacaLalala
Copy link
Author

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复!
我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。
是默认配置,training_config如下:
`args:
checkpoint_activations: true
model_parallel_size: 1
experiment_name: finetune-openvid-framesmin180-max500-origin-dataset
mode: finetune
load: CogVideoX-2b-sat/transformer
no_load_rng: true
train_iters: 10000
eval_iters: 1
eval_interval: 10000
eval_batch_size: 1
save: output
save_interval: 100
log_interval: 20
train_data:

  • dataset/mini_dataset/cogvideo/videos
    valid_data:
  • dataset/mini_dataset/cogvideo/videos
    split: 1,0,0
    num_workers: 8
    force_train: true
    only_log_video_latents: true
    data:
    target: data_video.SFTDataset
    params:
    video_size:
    • 480
    • 720
      fps: 8
      max_num_frames: 49
      skip_frms_num: 3.0
      deepspeed:
      train_micro_batch_size_per_gpu: 1
      gradient_accumulation_steps: 1
      steps_per_print: 50
      gradient_clipping: 0.1
      zero_optimization:
      stage: 2
      cpu_offload: false
      contiguous_gradients: false
      overlap_comm: true
      reduce_scatter: true
      reduce_bucket_size: 1000000000
      allgather_bucket_size: 1000000000
      load_from_fp32_weights: false
      zero_allow_untested_optimizer: true
      bf16:
      enabled: false
      fp16:
      enabled: true
      loss_scale: 0
      loss_scale_window: 400
      hysteresis: 2
      min_loss_scale: 1
      optimizer:
      type: sat.ops.FusedEmaAdam
      params:
      lr: 0.0002
      betas:
      • 0.9
      • 0.95
        eps: 1.0e-08
        weight_decay: 0.0001
        activation_checkpointing:
        partition_activations: false
        contiguous_memory_optimization: false
        wall_clock_breakdown: false
        model:
        scale_factor: 1.15258426
        disable_first_stage_autocast: true
        log_keys:
  • txt
    denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
    num_idx: 1000
    quantize_c_noise: false
    weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
    scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
    time_embed_dim: 512
    elementwise_affine: true
    num_frames: 49
    time_compressed_rate: 4
    latent_width: 90
    latent_height: 60
    num_layers: 30
    patch_size: 2
    in_channels: 16
    out_channels: 16
    hidden_size: 1920
    adm_in_channels: 256
    num_attention_heads: 30
    transformer_args:
    checkpoint_activations: true
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false
    modules:
    pos_embed_config:
    target: dit_video_concat.Basic3DPositionEmbeddingMixin
    params:
    text_length: 226
    height_interpolation: 1.875
    width_interpolation: 1.875
    patch_embed_config:
    target: dit_video_concat.ImagePatchEmbeddingMixin
    params:
    text_hidden_size: 4096
    adaln_layer_config:
    target: dit_video_concat.AdaLNMixin
    params:
    qk_ln: true
    final_layer_config:
    target: dit_video_concat.FinalLayerMixin
    conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
    emb_models:
    • is_trainable: false
      input_key: txt
      ucg_rate: 0.1
      target: sgm.modules.encoders.modules.FrozenT5Embedder
      params:
      model_dir: ckpts/cogvideo/t5-v1_1-xxl
      max_length: 226
      first_stage_config:
      target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
      params:
      cp_size: 1
      ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
      ignore_keys:
    • loss
      loss_config:
      target: torch.nn.Identity
      regularizer_config:
      target: vae_modules.regularizers.DiagonalGaussianRegularizer
      encoder_config:
      target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
      params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: true
        decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
        double_z: true
        z_channels: 16
        resolution: 256
        in_channels: 3
        out_ch: 3
        ch: 128
        ch_mult:
      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: false
        loss_fn_config:
        target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
        params:
        offset_noise_level: 0
        sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
        uniform_sampling: true
        num_idx: 1000
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        sampler_config:
        target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
        params:
        num_steps: 50
        verbose: true
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
        scale: 6
        exp: 5
        num_steps: 50
        `
        不好意思我目前还没对repo做过多修改,loss还没有记录下来
        您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重
        500次迭代:
000000.mp4

4000次迭代:

000000.mp4

期待您的回复~

@tengjiayan20
Copy link
Contributor

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK.
We will update config files and fine-tune instructions soon.

@CacacaLalala
Copy link
Author

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability.
Looking forward to your reply!

@tengjiayan20
Copy link
Contributor

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size?
And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

@CacacaLalala
Copy link
Author

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

The total batch size is 24*2, and I'm using 100w dataset by changing dataset part. Next, waiting for more iterations, I test the training again. Thanks a lot!

@tengjiayan20 tengjiayan20 self-assigned this Aug 15, 2024
@GFENGG
Copy link

GFENGG commented Aug 16, 2024

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos
    valid_data:

  • dataset/mini_dataset/cogvideo/videos
    split: 1,0,0
    num_workers: 8
    force_train: true
    only_log_video_latents: true
    data:
    target: data_video.SFTDataset
    params:
    video_size:

    • 480

    • 720
      fps: 8
      max_num_frames: 49
      skip_frms_num: 3.0
      deepspeed:
      train_micro_batch_size_per_gpu: 1
      gradient_accumulation_steps: 1
      steps_per_print: 50
      gradient_clipping: 0.1
      zero_optimization:
      stage: 2
      cpu_offload: false
      contiguous_gradients: false
      overlap_comm: true
      reduce_scatter: true
      reduce_bucket_size: 1000000000
      allgather_bucket_size: 1000000000
      load_from_fp32_weights: false
      zero_allow_untested_optimizer: true
      bf16:
      enabled: false
      fp16:
      enabled: true
      loss_scale: 0
      loss_scale_window: 400
      hysteresis: 2
      min_loss_scale: 1
      optimizer:
      type: sat.ops.FusedEmaAdam
      params:
      lr: 0.0002
      betas:

      • 0.9
      • 0.95
        eps: 1.0e-08
        weight_decay: 0.0001
        activation_checkpointing:
        partition_activations: false
        contiguous_memory_optimization: false
        wall_clock_breakdown: false
        model:
        scale_factor: 1.15258426
        disable_first_stage_autocast: true
        log_keys:
  • txt
    denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
    num_idx: 1000
    quantize_c_noise: false
    weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
    scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
    time_embed_dim: 512
    elementwise_affine: true
    num_frames: 49
    time_compressed_rate: 4
    latent_width: 90
    latent_height: 60
    num_layers: 30
    patch_size: 2
    in_channels: 16
    out_channels: 16
    hidden_size: 1920
    adm_in_channels: 256
    num_attention_heads: 30
    transformer_args:
    checkpoint_activations: true
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false
    modules:
    pos_embed_config:
    target: dit_video_concat.Basic3DPositionEmbeddingMixin
    params:
    text_length: 226
    height_interpolation: 1.875
    width_interpolation: 1.875
    patch_embed_config:
    target: dit_video_concat.ImagePatchEmbeddingMixin
    params:
    text_hidden_size: 4096
    adaln_layer_config:
    target: dit_video_concat.AdaLNMixin
    params:
    qk_ln: true
    final_layer_config:
    target: dit_video_concat.FinalLayerMixin
    conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
    emb_models:

    • is_trainable: false
      input_key: txt
      ucg_rate: 0.1
      target: sgm.modules.encoders.modules.FrozenT5Embedder
      params:
      model_dir: ckpts/cogvideo/t5-v1_1-xxl
      max_length: 226
      first_stage_config:
      target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
      params:
      cp_size: 1
      ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
      ignore_keys:

    • loss
      loss_config:
      target: torch.nn.Identity
      regularizer_config:
      target: vae_modules.regularizers.DiagonalGaussianRegularizer
      encoder_config:
      target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
      params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:

      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: true
        decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
        double_z: true
        z_channels: 16
        resolution: 256
        in_channels: 3
        out_ch: 3
        ch: 128
        ch_mult:
      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: false
        loss_fn_config:
        target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
        params:
        offset_noise_level: 0
        sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
        uniform_sampling: true
        num_idx: 1000
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        sampler_config:
        target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
        params:
        num_steps: 50
        verbose: true
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
        scale: 6
        exp: 5
        num_steps: 50
        `
        不好意思我目前还没对repo做过多修改,loss还没有记录下来
        您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重
        500次迭代:

000000.mp4
4000次迭代:

000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

@CacacaLalala
Copy link
Author

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos
    valid_data:

  • dataset/mini_dataset/cogvideo/videos
    split: 1,0,0
    num_workers: 8
    force_train: true
    only_log_video_latents: true
    data:
    target: data_video.SFTDataset
    params:
    video_size:

    • 480

    • 720
      fps: 8
      max_num_frames: 49
      skip_frms_num: 3.0
      deepspeed:
      train_micro_batch_size_per_gpu: 1
      gradient_accumulation_steps: 1
      steps_per_print: 50
      gradient_clipping: 0.1
      zero_optimization:
      stage: 2
      cpu_offload: false
      contiguous_gradients: false
      overlap_comm: true
      reduce_scatter: true
      reduce_bucket_size: 1000000000
      allgather_bucket_size: 1000000000
      load_from_fp32_weights: false
      zero_allow_untested_optimizer: true
      bf16:
      enabled: false
      fp16:
      enabled: true
      loss_scale: 0
      loss_scale_window: 400
      hysteresis: 2
      min_loss_scale: 1
      optimizer:
      type: sat.ops.FusedEmaAdam
      params:
      lr: 0.0002
      betas:

      • 0.9
      • 0.95
        eps: 1.0e-08
        weight_decay: 0.0001
        activation_checkpointing:
        partition_activations: false
        contiguous_memory_optimization: false
        wall_clock_breakdown: false
        model:
        scale_factor: 1.15258426
        disable_first_stage_autocast: true
        log_keys:
  • txt
    denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
    num_idx: 1000
    quantize_c_noise: false
    weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
    scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
    time_embed_dim: 512
    elementwise_affine: true
    num_frames: 49
    time_compressed_rate: 4
    latent_width: 90
    latent_height: 60
    num_layers: 30
    patch_size: 2
    in_channels: 16
    out_channels: 16
    hidden_size: 1920
    adm_in_channels: 256
    num_attention_heads: 30
    transformer_args:
    checkpoint_activations: true
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false
    modules:
    pos_embed_config:
    target: dit_video_concat.Basic3DPositionEmbeddingMixin
    params:
    text_length: 226
    height_interpolation: 1.875
    width_interpolation: 1.875
    patch_embed_config:
    target: dit_video_concat.ImagePatchEmbeddingMixin
    params:
    text_hidden_size: 4096
    adaln_layer_config:
    target: dit_video_concat.AdaLNMixin
    params:
    qk_ln: true
    final_layer_config:
    target: dit_video_concat.FinalLayerMixin
    conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
    emb_models:

    • is_trainable: false
      input_key: txt
      ucg_rate: 0.1
      target: sgm.modules.encoders.modules.FrozenT5Embedder
      params:
      model_dir: ckpts/cogvideo/t5-v1_1-xxl
      max_length: 226
      first_stage_config:
      target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
      params:
      cp_size: 1
      ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
      ignore_keys:

    • loss
      loss_config:
      target: torch.nn.Identity
      regularizer_config:
      target: vae_modules.regularizers.DiagonalGaussianRegularizer
      encoder_config:
      target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
      params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:

      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: true
        decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
        double_z: true
        z_channels: 16
        resolution: 256
        in_channels: 3
        out_ch: 3
        ch: 128
        ch_mult:
      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: false
        loss_fn_config:
        target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
        params:
        offset_noise_level: 0
        sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
        uniform_sampling: true
        num_idx: 1000
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        sampler_config:
        target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
        params:
        num_steps: 50
        verbose: true
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
        scale: 6
        exp: 5
        num_steps: 50
        `
        不好意思我目前还没对repo做过多修改,loss还没有记录下来
        您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重
        500次迭代:

000000.mp4
4000次迭代:
000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

一开始说的扭曲就是空间结构会有一些不合理。
目前多训练了几天,刚刚测试了一下,看起来效果正常啦,感谢。

@GFENGG
Copy link

GFENGG commented Aug 16, 2024

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos
    valid_data:

  • dataset/mini_dataset/cogvideo/videos
    split: 1,0,0
    num_workers: 8
    force_train: true
    only_log_video_latents: true
    data:
    target: data_video.SFTDataset
    params:
    video_size:

    • 480

    • 720
      fps: 8
      max_num_frames: 49
      skip_frms_num: 3.0
      deepspeed:
      train_micro_batch_size_per_gpu: 1
      gradient_accumulation_steps: 1
      steps_per_print: 50
      gradient_clipping: 0.1
      zero_optimization:
      stage: 2
      cpu_offload: false
      contiguous_gradients: false
      overlap_comm: true
      reduce_scatter: true
      reduce_bucket_size: 1000000000
      allgather_bucket_size: 1000000000
      load_from_fp32_weights: false
      zero_allow_untested_optimizer: true
      bf16:
      enabled: false
      fp16:
      enabled: true
      loss_scale: 0
      loss_scale_window: 400
      hysteresis: 2
      min_loss_scale: 1
      optimizer:
      type: sat.ops.FusedEmaAdam
      params:
      lr: 0.0002
      betas:

      • 0.9
      • 0.95
        eps: 1.0e-08
        weight_decay: 0.0001
        activation_checkpointing:
        partition_activations: false
        contiguous_memory_optimization: false
        wall_clock_breakdown: false
        model:
        scale_factor: 1.15258426
        disable_first_stage_autocast: true
        log_keys:
  • txt
    denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
    num_idx: 1000
    quantize_c_noise: false
    weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
    scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
    discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
    shift_scale: 3.0
    network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
    time_embed_dim: 512
    elementwise_affine: true
    num_frames: 49
    time_compressed_rate: 4
    latent_width: 90
    latent_height: 60
    num_layers: 30
    patch_size: 2
    in_channels: 16
    out_channels: 16
    hidden_size: 1920
    adm_in_channels: 256
    num_attention_heads: 30
    transformer_args:
    checkpoint_activations: true
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false
    modules:
    pos_embed_config:
    target: dit_video_concat.Basic3DPositionEmbeddingMixin
    params:
    text_length: 226
    height_interpolation: 1.875
    width_interpolation: 1.875
    patch_embed_config:
    target: dit_video_concat.ImagePatchEmbeddingMixin
    params:
    text_hidden_size: 4096
    adaln_layer_config:
    target: dit_video_concat.AdaLNMixin
    params:
    qk_ln: true
    final_layer_config:
    target: dit_video_concat.FinalLayerMixin
    conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
    emb_models:

    • is_trainable: false
      input_key: txt
      ucg_rate: 0.1
      target: sgm.modules.encoders.modules.FrozenT5Embedder
      params:
      model_dir: ckpts/cogvideo/t5-v1_1-xxl
      max_length: 226
      first_stage_config:
      target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
      params:
      cp_size: 1
      ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt
      ignore_keys:

    • loss
      loss_config:
      target: torch.nn.Identity
      regularizer_config:
      target: vae_modules.regularizers.DiagonalGaussianRegularizer
      encoder_config:
      target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
      params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:

      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: true
        decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
        double_z: true
        z_channels: 16
        resolution: 256
        in_channels: 3
        out_ch: 3
        ch: 128
        ch_mult:
      • 1
      • 2
      • 2
      • 4
        attn_resolutions: []
        num_res_blocks: 3
        dropout: 0.0
        gather_norm: false
        loss_fn_config:
        target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
        params:
        offset_noise_level: 0
        sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
        uniform_sampling: true
        num_idx: 1000
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        sampler_config:
        target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
        params:
        num_steps: 50
        verbose: true
        discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
        shift_scale: 3.0
        guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
        scale: 6
        exp: 5
        num_steps: 50
        `
        不好意思我目前还没对repo做过多修改,loss还没有记录下来
        您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重
        500次迭代:

000000.mp4
4000次迭代:
000000.mp4
期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

一开始说的扭曲就是空间结构会有一些不合理。 目前多训练了几天,刚刚测试了一下,看起来效果正常啦,感谢。

我也在尝试finetune,所以空间结构不合理的问题是靠调小学习率 + 长时间训练解决的么?

@CacacaLalala
Copy link
Author

理的问题是靠调小学习率 + 长时间训练解

目前看是这样

@a-r-r-o-w
Copy link

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

  • Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?
  • How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?
  • What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.
  • Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?
  • Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs
  • How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.
  • Has anyone successfully trained a LoRA with lower rank than 128 producing good results?
  • What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?
  • Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

@rainbow979
Copy link

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

  • Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?
  • How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?
  • What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.
  • Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?
  • Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs
  • How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.
  • Has anyone successfully trained a LoRA with lower rank than 128 producing good results?
  • What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?
  • Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

@a-r-r-o-w Have you got any answers? I'm also very curious about.

@a-r-r-o-w
Copy link

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

  • 50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity
  • 2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising
  • Initialization does not seem to have much effect. peft defaults are great
  • Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too
  • The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now
  • We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!
  • Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)
  • For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU
  • Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

@rainbow979
Copy link

Thanks a lot for replying! They are very helpful. I have one more question: so we don't need to use EMA model to train?

@a-r-r-o-w
Copy link

I think there was a good recent paper that showed EMA is not particularly helpful for LoRA training, but the results with it are not too qualitatively different without it. It's really hard to see any benefits on small scale runs atleast (<10k steps in my tests), given the added memory requirement

@crj1998
Copy link

crj1998 commented Dec 5, 2024

  • The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem?

@crj1998
Copy link

crj1998 commented Dec 11, 2024

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

  • 50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity
  • 2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising
  • Initialization does not seem to have much effect. peft defaults are great
  • Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too
  • The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now
  • We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!
  • Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)
  • For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU
  • Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem? (The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants