support HunYuan DiT #1378

KohakuBlueleaf · 2024-06-21T10:23:59Z

[WIP] This PR is a draft PR for contributors to check the progress and review the codes.

This PR starts with a simple implementation by me for minimal inference and some modifications:

modify the initialize method of HunYuanDiT to avoid the requirements of argparse
replace the flash_atth with pytorch sdp and xformers implementation.
implement the gradient checkpointing mechanism to save TONS OF VRAM.
support "CLIP concat" trick for long prompt.
- Need review by HunYuan team. Should work as original one with max_length_clip=77
a test script for quick check on inference.
- I didn't follow the style of xxx_minimal_inference. So I called it hunyuan_test.py. But it can be seen as a minimal inference script

Notes about loading model

The directory structure I used is:

model/
  clip/
  denoiser/
  mt5/
  vae/

basically download files from the t2i folder of HunYuanDiT

and put the content of clip_text_encoder and tokenizer into clip.
put mt5 into mt5, put model into denoiser, put sdxl-vae-fp16-fix into vae

This spec can be changed if needed.

TODO List

Low Priority TODO List

cache TE embeddings.

Notification to contributors

You can assume the create_network method from imported network module will work correctly.
- Kohya and I will ensure that.
Check sdxl_train.py and sdxl_train_network.py and the dataset things carefully before starting development. It is very likely that we only need few modification to make things work. Try to avoid any "fully rework".
If you want to contribute to this PR, open another PR into this branch: https://github.com/KohakuBlueleaf/sd-scripts/tree/HunYuanDiT
- I will check all the related PR/issue frequently in this week

KohakuBlueleaf · 2024-06-21T16:59:29Z

For anyone who want to try HunYuan but don't want to download the original 44GB files:
https://huggingface.co/KBlueLeaf/HunYuanDiT-V1.1-fp16-pruned

8GB only here.

KohakuBlueleaf · 2024-06-22T05:10:45Z

after commit #fb3e8a7
We have LyCORIS/LoRA training usable. (still need some checks on implementation detail, but it works)

Some functionality is not usable at this moment and will be fixed in the future.

KohakuBlueleaf · 2024-06-22T05:37:30Z

Requirements for lora/lycoris training on HunYuan:

lycoris-lora>=3.0.0.dev10
9GB vram for train unet only(train dit only)
12GB vram for train unet(dit) + train TE
- above 2 requirements are under cache latents and gradient checkpointing, bs1
- cache TE is not enabled
- full bf16/fp16

KohakuBlueleaf · 2024-06-22T05:41:00Z

currently fp16(mixed) will cause nan loss
will check which part goes wrong.

KohakuBlueleaf · 2024-06-22T07:04:08Z

FP16 fixed

KohakuBlueleaf · 2024-06-22T11:06:29Z

I use umamusume dataset (with danbooru tag prompt format) to train HunYuan DiT V1.1 with bs8 600step. (Train DiT only)
Looks like my implementation is ok.

Original	With LoKr (12MB, bs8, 600step)

KohakuBlueleaf · 2024-06-24T07:35:10Z

For those who want to try HunYuan training.
here is an example script:

sdbds · 2024-06-26T07:57:07Z

← original dataset →test result

* support hunyuan lora in CLIP text encoder and DiT blocks * add hunyuan lora test script * append lora blocks in target module --------- Co-authored-by: leoriojhli <[email protected]>

* support hunyuan lora in CLIP text encoder and DiT blocks * add hunyuan lora test script * append lora blocks in target module * Support HunYuanDiT v1.1 and v1.2 lora --------- Co-authored-by: leoriojhli <[email protected]>

* add use_extra_cond for hy_train_network * change model version * Update hunyuan_train_network.py

* add use_extra_cond for hy_train_network * change model version * Update hunyuan_train_network.py * Update hunyuan_train_network.py * Update hunyuan_train.py

tristanwqy · 2024-07-04T07:05:38Z

If HunyuanDIT also uses the VAE from sdxl, does that mean the prepare bucket latents can reuse the data from sdxl?

KohakuBlueleaf · 2024-07-04T07:06:52Z

If HunyuanDIT also uses the VAE from sdxl, does that mean the prepare bucket latents can reuse the data from sdxl?

Yes, and kohya's latent caching will only check the size.
So you can use the dataset folder which already have cached latent

tristanwqy · 2024-07-04T08:01:54Z

If HunyuanDIT also uses the VAE from sdxl, does that mean the prepare bucket latents can reuse the data from sdxl?

Yes, and kohya's latent caching will only check the size. So you can use the dataset folder which already have cached latent

OK, Thanks

tristanwqy · 2024-07-04T08:04:13Z

I was trying to run the code below in libray/hunyuan_models

    root = "/workspace/models/hunyuan/HunYuanDiT-V1.2-fp16-pruned/"
    denoiser, patch_size, head_dim = DiT_g_2(input_size=(128, 128))
    sd = torch.load(os.path.join(root, "denoiser/pytorch_model_module.pt"))
    denoiser.load_state_dict(sd)
    denoiser.half().cuda()
    denoiser.enable_gradient_checkpointing()

    clip_tokenizer = AutoTokenizer.from_pretrained(os.path.join(root, "clip"))
    clip_encoder = BertModel.from_pretrained(os.path.join(root, "clip")).half().cuda()

    mt5_embedder = MT5Embedder(os.path.join(root, "mt5"), torch_dtype=torch.float16, max_length=256)

    vae = AutoencoderKL.from_pretrained(os.path.join(root, "vae")).half().cuda()

    print(sum(p.numel() for p in denoiser.parameters()) / 1e6)
    print(sum(p.numel() for p in mt5_embedder.parameters()) / 1e6)
    print(sum(p.numel() for p in clip_encoder.parameters()) / 1e6)
    print(sum(p.numel() for p in vae.parameters()) / 1e6)

but failed with

    Use xformers attention implementation.
    Number of tokens: 4096
Traceback (most recent call last):
  File "/home/ubuntu/sd-scripts/library/hunyuan_models.py", line 1287, in <module>
    denoiser.load_state_dict(sd)
  File "/home/ubuntu/miniconda3/envs/training/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for HunYuanDiT:
	Unexpected key(s) in state_dict: "style_embedder.weight". 
	size mismatch for extra_embedder.0.weight: copying a param with shape torch.Size([5632, 3968]) from checkpoint, the shape in current model is torch.Size([5632, 1024]).

sdbds · 2024-07-04T13:19:53Z

I was trying to run the code below in libray/hunyuan_models

    root = "/workspace/models/hunyuan/HunYuanDiT-V1.2-fp16-pruned/"
    denoiser, patch_size, head_dim = DiT_g_2(input_size=(128, 128))
    sd = torch.load(os.path.join(root, "denoiser/pytorch_model_module.pt"))
    denoiser.load_state_dict(sd)
    denoiser.half().cuda()
    denoiser.enable_gradient_checkpointing()

    clip_tokenizer = AutoTokenizer.from_pretrained(os.path.join(root, "clip"))
    clip_encoder = BertModel.from_pretrained(os.path.join(root, "clip")).half().cuda()

    mt5_embedder = MT5Embedder(os.path.join(root, "mt5"), torch_dtype=torch.float16, max_length=256)

    vae = AutoencoderKL.from_pretrained(os.path.join(root, "vae")).half().cuda()

    print(sum(p.numel() for p in denoiser.parameters()) / 1e6)
    print(sum(p.numel() for p in mt5_embedder.parameters()) / 1e6)
    print(sum(p.numel() for p in clip_encoder.parameters()) / 1e6)
    print(sum(p.numel() for p in vae.parameters()) / 1e6)

but failed with

    Use xformers attention implementation.
    Number of tokens: 4096
Traceback (most recent call last):
  File "/home/ubuntu/sd-scripts/library/hunyuan_models.py", line 1287, in <module>
    denoiser.load_state_dict(sd)
  File "/home/ubuntu/miniconda3/envs/training/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for HunYuanDiT:
	Unexpected key(s) in state_dict: "style_embedder.weight". 
	size mismatch for extra_embedder.0.weight: copying a param with shape torch.Size([5632, 3968]) from checkpoint, the shape in current model is torch.Size([5632, 1024]).

1.2 delete style_embedder.weight, wait they fix it.

KohakuBlueleaf · 2024-07-04T13:22:35Z

I was trying to run the code below in libray/hunyuan_models

    root = "/workspace/models/hunyuan/HunYuanDiT-V1.2-fp16-pruned/"
    denoiser, patch_size, head_dim = DiT_g_2(input_size=(128, 128))
    sd = torch.load(os.path.join(root, "denoiser/pytorch_model_module.pt"))
    denoiser.load_state_dict(sd)
    denoiser.half().cuda()
    denoiser.enable_gradient_checkpointing()

    clip_tokenizer = AutoTokenizer.from_pretrained(os.path.join(root, "clip"))
    clip_encoder = BertModel.from_pretrained(os.path.join(root, "clip")).half().cuda()

    mt5_embedder = MT5Embedder(os.path.join(root, "mt5"), torch_dtype=torch.float16, max_length=256)

    vae = AutoencoderKL.from_pretrained(os.path.join(root, "vae")).half().cuda()

    print(sum(p.numel() for p in denoiser.parameters()) / 1e6)
    print(sum(p.numel() for p in mt5_embedder.parameters()) / 1e6)
    print(sum(p.numel() for p in clip_encoder.parameters()) / 1e6)
    print(sum(p.numel() for p in vae.parameters()) / 1e6)

but failed with

    Use xformers attention implementation.
    Number of tokens: 4096
Traceback (most recent call last):
  File "/home/ubuntu/sd-scripts/library/hunyuan_models.py", line 1287, in <module>
    denoiser.load_state_dict(sd)
  File "/home/ubuntu/miniconda3/envs/training/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for HunYuanDiT:
	Unexpected key(s) in state_dict: "style_embedder.weight". 
	size mismatch for extra_embedder.0.weight: copying a param with shape torch.Size([5632, 3968]) from checkpoint, the shape in current model is torch.Size([5632, 1024]).

1.2 delete style_embedder.weight, wait they fix it.

I think the problem is we need --extra_cond arg for v1.0/1.1 to enable extra cond

Not sure if you have implemented this into train network script
But full training should work with that arg

tristanwqy · 2024-07-05T04:29:14Z

I was trying to run the code below in libray/hunyuan_models

    root = "/workspace/models/hunyuan/HunYuanDiT-V1.2-fp16-pruned/"
    denoiser, patch_size, head_dim = DiT_g_2(input_size=(128, 128))
    sd = torch.load(os.path.join(root, "denoiser/pytorch_model_module.pt"))
    denoiser.load_state_dict(sd)
    denoiser.half().cuda()
    denoiser.enable_gradient_checkpointing()

    clip_tokenizer = AutoTokenizer.from_pretrained(os.path.join(root, "clip"))
    clip_encoder = BertModel.from_pretrained(os.path.join(root, "clip")).half().cuda()

    mt5_embedder = MT5Embedder(os.path.join(root, "mt5"), torch_dtype=torch.float16, max_length=256)

    vae = AutoencoderKL.from_pretrained(os.path.join(root, "vae")).half().cuda()

    print(sum(p.numel() for p in denoiser.parameters()) / 1e6)
    print(sum(p.numel() for p in mt5_embedder.parameters()) / 1e6)
    print(sum(p.numel() for p in clip_encoder.parameters()) / 1e6)
    print(sum(p.numel() for p in vae.parameters()) / 1e6)

but failed with

    Use xformers attention implementation.
    Number of tokens: 4096
Traceback (most recent call last):
  File "/home/ubuntu/sd-scripts/library/hunyuan_models.py", line 1287, in <module>
    denoiser.load_state_dict(sd)
  File "/home/ubuntu/miniconda3/envs/training/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for HunYuanDiT:
	Unexpected key(s) in state_dict: "style_embedder.weight". 
	size mismatch for extra_embedder.0.weight: copying a param with shape torch.Size([5632, 3968]) from checkpoint, the shape in current model is torch.Size([5632, 1024]).

1.2 delete style_embedder.weight, wait they fix it.

I think the problem is we need --extra_cond arg for v1.0/1.1 to enable extra cond

Not sure if you have implemented this into train network script But full training should work with that arg

v1.2 with enable extra cond = True works, but the image generated looks a little bit weird, and that doesn't make sense.
On the other hand, V1.1 works perfectly fine

KohakuBlueleaf · 2024-07-05T04:48:06Z

You should disable extra cond for v1.2

KohakuBlueleaf added 9 commits June 21, 2024 18:12

Init implementation

9f432ea

typo

dabb22f

Remove redundant reshape

148756d

add get_input_ids/get_hidden_states for t5

48f7739

better impl to fit Kohya style

dc6c97e

Init model loader

efe4943

Delete test.png

b6073ac

Some improvements

e30c503

Update hunyuan_test.py

09620b4

KohakuBlueleaf added 3 commits June 22, 2024 01:07

A quick conversion of train_network from sdxl ver

982cf79

initial support of lycoris/lora + hunyuan dit

fb3e8a7

Remove debug print

32f15c9

Support train TE

6155efc

Fix NaN error

580b489

Add lycoris sample code

945688d

move load scheduler to method

c0a936e

xljh0520 and others added 5 commits June 28, 2024 22:58

We add lora in DiT blocks and CLIP text encoder layers. (#5)

5598cfd

* support hunyuan lora in CLIP text encoder and DiT blocks * add hunyuan lora test script * append lora blocks in target module --------- Co-authored-by: leoriojhli <[email protected]>

Support HunyuanDiT finetuning all parameters (#6)

4141c7a

Support HunYuanDiT v1.1 and v1.2 lora (#8)

ee724d3

* support hunyuan lora in CLIP text encoder and DiT blocks * add hunyuan lora test script * append lora blocks in target module * Support HunYuanDiT v1.1 and v1.2 lora --------- Co-authored-by: leoriojhli <[email protected]>

Add extra cond for Hunyuan_train_network (#7)

b69cc54

* add use_extra_cond for hy_train_network * change model version * Update hunyuan_train_network.py

Final change (#9)

0dc79ed

* add use_extra_cond for hy_train_network * change model version * Update hunyuan_train_network.py * Update hunyuan_train_network.py * Update hunyuan_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support HunYuan DiT #1378

support HunYuan DiT #1378

KohakuBlueleaf commented Jun 21, 2024 •

edited

Loading

KohakuBlueleaf commented Jun 21, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 24, 2024 •

edited

Loading

sdbds commented Jun 26, 2024

tristanwqy commented Jul 4, 2024

KohakuBlueleaf commented Jul 4, 2024

tristanwqy commented Jul 4, 2024

tristanwqy commented Jul 4, 2024

sdbds commented Jul 4, 2024 •

edited

Loading

KohakuBlueleaf commented Jul 4, 2024

tristanwqy commented Jul 5, 2024 •

edited

Loading

KohakuBlueleaf commented Jul 5, 2024

support HunYuan DiT #1378

Are you sure you want to change the base?

support HunYuan DiT #1378

Conversation

KohakuBlueleaf commented Jun 21, 2024 • edited Loading

[WIP] This PR is a draft PR for contributors to check the progress and review the codes.

This PR starts with a simple implementation by me for minimal inference and some modifications:

Notes about loading model

TODO List

Low Priority TODO List

Notification to contributors

KohakuBlueleaf commented Jun 21, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 22, 2024

KohakuBlueleaf commented Jun 24, 2024 • edited Loading

sdbds commented Jun 26, 2024

tristanwqy commented Jul 4, 2024

KohakuBlueleaf commented Jul 4, 2024

tristanwqy commented Jul 4, 2024

tristanwqy commented Jul 4, 2024

sdbds commented Jul 4, 2024 • edited Loading

KohakuBlueleaf commented Jul 4, 2024

tristanwqy commented Jul 5, 2024 • edited Loading

KohakuBlueleaf commented Jul 5, 2024

KohakuBlueleaf commented Jun 21, 2024 •

edited

Loading

KohakuBlueleaf commented Jun 24, 2024 •

edited

Loading

sdbds commented Jul 4, 2024 •

edited

Loading

tristanwqy commented Jul 5, 2024 •

edited

Loading