Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dreambooth Flux training does not save a model for around 10-15 minutes #9501

Open
kopyl opened this issue Sep 23, 2024 · 16 comments
Open

Dreambooth Flux training does not save a model for around 10-15 minutes #9501

kopyl opened this issue Sep 23, 2024 · 16 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@kopyl
Copy link
Contributor

kopyl commented Sep 23, 2024

Describe the bug

This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.

Reproduction

Run accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
!git clone https://github.com/huggingface/diffusers
%cd diffusers

%cd diffusers

!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt

import huggingface_hub
huggingface_hub.notebook_login()

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2 \
  --seed="0" \
  --checkpointing_steps=9999999999999999

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 16225.55it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 17623.13it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  8.39it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 7033.49it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.42it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 63230.71it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 15:02:30 - INFO - __main__ - ***** Running training *****
09/23/2024 15:02:30 - INFO - __main__ -   Num examples = 10
09/23/2024 15:02:30 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 15:02:30 - INFO - __main__ -   Num Epochs = 1
09/23/2024 15:02:30 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 15:02:30 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 15:02:30 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 15:02:30 - INFO - __main__ -   Total optimization steps = 2
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                              | 0/2 [00:23<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:24<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:25<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|██████████          | 1/2 [00:28<00:28, 28.25s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 100%|████████████████████| 2/2 [00:30<00:00, 13.05s/it, loss=0.456, lr=1]
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  43%|█████▌       | 3/7 [00:00<00:00, 19.11it/s]Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.


Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:  50%|█████████         | 1/2 [00:00<00:00,  2.69it/s]

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.87it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  71%|█████████▎   | 5/7 [00:00<00:00,  4.63it/s]Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  86%|███████████▏ | 6/7 [00:01<00:00,  5.05it/s]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|█████████████| 7/7 [00:01<00:00,  6.26it/s]
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/config.json
Model weights saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/diffusion_pytorch_model.safetensors
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/transformer/config.json
[rank0]:[E923 15:13:12.047707465 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank0]:[E923 15:13:12.047923067 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598657835 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598687105 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 15:13:13.598692476 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 15:13:13.599794925 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fe885173a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

E0923 15:13:20.499235 139957517330240 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 127496) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_15:13:20
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 127496)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 127496
=======================================================

System Info

Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5aebbac0cc18168076a18836f175753d1c7x

Who can help?

No response

@kopyl kopyl added the bug Something isn't working label Sep 23, 2024
@kopyl
Copy link
Contributor Author

kopyl commented Sep 23, 2024

is it because it takes a lot of time to get the params from 2 GPU + CPU?

So do I need to set NCCL_TIMEOUT env variable to some large number? 1200 does not seem to be enough right now if it's the real cause.

If you ask me how to set it, you just add a new line in your bash profile file (~/.bash_profile on Ubuntu 20.04)
export NCCL_TIMEOUT=1200

@kopyl
Copy link
Contributor Author

kopyl commented Sep 23, 2024

I just tried removing if accelerator.is_main_process: to save on all machines (which is not the most optimal solution and it saved something (but not everything which is required). I got this error:

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 15:43:02 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
09/23/2024 15:43:03 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 16384.00it/s]
Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 4173.44it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  5.79it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.75it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 29746.84it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 45100.04it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 15:43:20 - INFO - __main__ - ***** Running training *****
09/23/2024 15:43:20 - INFO - __main__ -   Num examples = 10
09/23/2024 15:43:20 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 15:43:20 - INFO - __main__ -   Num Epochs = 1
09/23/2024 15:43:20 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 15:43:20 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 15:43:20 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 15:43:20 - INFO - __main__ -   Total optimization steps = 2
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                              | 0/2 [00:28<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:29<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:31<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|██████████          | 1/2 [00:33<00:33, 33.83s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...:  14%|█▊           | 1/7 [00:00<00:00,  6.52it/s]Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  29%|███▋         | 2/7 [00:00<00:00, 12.76it/s]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.
Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.

Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████████         | 1/2 [00:00<00:00,  9.24it/s]Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  86%|███████████▏ | 6/7 [00:00<00:00, 14.98it/s]

Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  8.86it/s]
Loading pipeline components...: 100%|█████████████| 7/7 [00:00<00:00, 10.15it/s]
before save


Loading checkpoint shards:  50%|█████████         | 1/2 [00:00<00:00,  2.85it/s]

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  3.11it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|█████████████| 7/7 [00:01<00:00,  6.44it/s]
before save
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/config.json
Model weights saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/diffusion_pytorch_model.safetensors
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/transformer/config.json
The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 5 checkpoint shards. You can find where each parameters has been saved in the index located at /flux-dreambooth-outputs/dreamboot-yaremovaa/transformer/diffusion_pytorch_model.safetensors.index.json.
after save
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/scheduler/scheduler_config.json
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/model_index.json
after save
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of /flux-dreambooth-outputs/dreamboot-yaremovaa.
Loading pipeline components...:  14%|█▊           | 1/7 [00:00<00:00,  6.41it/s]Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of /flux-dreambooth-outputs/dreamboot-yaremovaa.

Loading pipeline components...:  29%|███▋         | 2/7 [00:00<00:00, 12.71it/s]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of /flux-dreambooth-outputs/dreamboot-yaremovaa.
Loaded vae as AutoencoderKL from `vae` subfolder of /flux-dreambooth-outputs/dreamboot-yaremovaa.
Loading pipeline components...:  29%|███▋         | 2/7 [00:00<00:00,  6.73it/s]
Loading pipeline components...:  57%|███████▍     | 4/7 [00:00<00:00, 11.60it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1793, in <module>
[rank1]:     main(args)
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1750, in main
[rank1]:     pipeline = FluxPipeline.from_pretrained(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/diffusers/src/diffusers/pipelines/pipeline_utils.py", line 871, in from_pretrained
[rank1]:     loaded_sub_model = load_sub_model(
[rank1]:   File "/diffusers/src/diffusers/pipelines/pipeline_loading_utils.py", line 698, in load_sub_model
[rank1]:     loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/diffusers/src/diffusers/models/modeling_utils.py", line 774, in from_pretrained
[rank1]:     accelerate.load_checkpoint_and_dispatch(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py", line 613, in load_checkpoint_and_dispatch
[rank1]:     load_checkpoint_in_model(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py", line 1878, in load_checkpoint_in_model
[rank1]:     set_module_tensor_to_device(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py", line 341, in set_module_tensor_to_device
[rank1]:     raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
[rank1]: ValueError: FluxTransformer2DModel(
[rank1]:   (pos_embed): FluxPosEmbed()
[rank1]:   (time_text_embed): CombinedTimestepGuidanceTextProjEmbeddings(
[rank1]:     (time_proj): Timesteps()
[rank1]:     (timestep_embedder): TimestepEmbedding(
[rank1]:       (linear_1): Linear(in_features=256, out_features=3072, bias=True)
[rank1]:       (act): SiLU()
[rank1]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:     )
[rank1]:     (guidance_embedder): TimestepEmbedding(
[rank1]:       (linear_1): Linear(in_features=256, out_features=3072, bias=True)
[rank1]:       (act): SiLU()
[rank1]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:     )
[rank1]:     (text_embedder): PixArtAlphaTextProjection(
[rank1]:       (linear_1): Linear(in_features=768, out_features=3072, bias=True)
[rank1]:       (act_1): SiLU()
[rank1]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:     )
[rank1]:   )
[rank1]:   (context_embedder): Linear(in_features=4096, out_features=3072, bias=True)
[rank1]:   (x_embedder): Linear(in_features=64, out_features=3072, bias=True)
[rank1]:   (transformer_blocks): ModuleList(
[rank1]:     (0-18): 19 x FluxTransformerBlock(
[rank1]:       (norm1): AdaLayerNormZero(
[rank1]:         (silu): SiLU()
[rank1]:         (linear): Linear(in_features=3072, out_features=18432, bias=True)
[rank1]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:       )
[rank1]:       (norm1_context): AdaLayerNormZero(
[rank1]:         (silu): SiLU()
[rank1]:         (linear): Linear(in_features=3072, out_features=18432, bias=True)
[rank1]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:       )
[rank1]:       (attn): Attention(
[rank1]:         (norm_q): RMSNorm()
[rank1]:         (norm_k): RMSNorm()
[rank1]:         (to_q): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (to_k): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (to_v): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (add_k_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (add_v_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (add_q_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (to_out): ModuleList(
[rank1]:           (0): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:           (1): Dropout(p=0.0, inplace=False)
[rank1]:         )
[rank1]:         (to_add_out): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (norm_added_q): RMSNorm()
[rank1]:         (norm_added_k): RMSNorm()
[rank1]:       )
[rank1]:       (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:       (ff): FeedForward(
[rank1]:         (net): ModuleList(
[rank1]:           (0): GELU(
[rank1]:             (proj): Linear(in_features=3072, out_features=12288, bias=True)
[rank1]:           )
[rank1]:           (1): Dropout(p=0.0, inplace=False)
[rank1]:           (2): Linear(in_features=12288, out_features=3072, bias=True)
[rank1]:         )
[rank1]:       )
[rank1]:       (norm2_context): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:       (ff_context): FeedForward(
[rank1]:         (net): ModuleList(
[rank1]:           (0): GELU(
[rank1]:             (proj): Linear(in_features=3072, out_features=12288, bias=True)
[rank1]:           )
[rank1]:           (1): Dropout(p=0.0, inplace=False)
[rank1]:           (2): Linear(in_features=12288, out_features=3072, bias=True)
[rank1]:         )
[rank1]:       )
[rank1]:     )
[rank1]:   )
[rank1]:   (single_transformer_blocks): ModuleList(
[rank1]:     (0-37): 38 x FluxSingleTransformerBlock(
[rank1]:       (norm): AdaLayerNormZeroSingle(
[rank1]:         (silu): SiLU()
[rank1]:         (linear): Linear(in_features=3072, out_features=9216, bias=True)
[rank1]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:       )
[rank1]:       (proj_mlp): Linear(in_features=3072, out_features=12288, bias=True)
[rank1]:       (act_mlp): GELU(approximate='tanh')
[rank1]:       (proj_out): Linear(in_features=15360, out_features=3072, bias=True)
[rank1]:       (attn): Attention(
[rank1]:         (norm_q): RMSNorm()
[rank1]:         (norm_k): RMSNorm()
[rank1]:         (to_q): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (to_k): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:         (to_v): Linear(in_features=3072, out_features=3072, bias=True)
[rank1]:       )
[rank1]:     )
[rank1]:   )
[rank1]:   (norm_out): AdaLayerNormContinuous(
[rank1]:     (silu): SiLU()
[rank1]:     (linear): Linear(in_features=3072, out_features=6144, bias=True)
[rank1]:     (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank1]:   )
[rank1]:   (proj_out): Linear(in_features=3072, out_features=64, bias=True)
[rank1]: ) does not have a parameter or a buffer named _flat_param.
[rank0]: Traceback (most recent call last):
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1793, in <module>
[rank0]:     main(args)
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1750, in main
[rank0]:     pipeline = FluxPipeline.from_pretrained(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/diffusers/src/diffusers/pipelines/pipeline_utils.py", line 871, in from_pretrained
[rank0]:     loaded_sub_model = load_sub_model(
[rank0]:   File "/diffusers/src/diffusers/pipelines/pipeline_loading_utils.py", line 698, in load_sub_model
[rank0]:     loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/diffusers/src/diffusers/models/modeling_utils.py", line 774, in from_pretrained
[rank0]:     accelerate.load_checkpoint_and_dispatch(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py", line 613, in load_checkpoint_and_dispatch
[rank0]:     load_checkpoint_in_model(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py", line 1878, in load_checkpoint_in_model
[rank0]:     set_module_tensor_to_device(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py", line 341, in set_module_tensor_to_device
[rank0]:     raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
[rank0]: ValueError: FluxTransformer2DModel(
[rank0]:   (pos_embed): FluxPosEmbed()
[rank0]:   (time_text_embed): CombinedTimestepGuidanceTextProjEmbeddings(
[rank0]:     (time_proj): Timesteps()
[rank0]:     (timestep_embedder): TimestepEmbedding(
[rank0]:       (linear_1): Linear(in_features=256, out_features=3072, bias=True)
[rank0]:       (act): SiLU()
[rank0]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:     )
[rank0]:     (guidance_embedder): TimestepEmbedding(
[rank0]:       (linear_1): Linear(in_features=256, out_features=3072, bias=True)
[rank0]:       (act): SiLU()
[rank0]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:     )
[rank0]:     (text_embedder): PixArtAlphaTextProjection(
[rank0]:       (linear_1): Linear(in_features=768, out_features=3072, bias=True)
[rank0]:       (act_1): SiLU()
[rank0]:       (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:     )
[rank0]:   )
[rank0]:   (context_embedder): Linear(in_features=4096, out_features=3072, bias=True)
[rank0]:   (x_embedder): Linear(in_features=64, out_features=3072, bias=True)
[rank0]:   (transformer_blocks): ModuleList(
[rank0]:     (0-18): 19 x FluxTransformerBlock(
[rank0]:       (norm1): AdaLayerNormZero(
[rank0]:         (silu): SiLU()
[rank0]:         (linear): Linear(in_features=3072, out_features=18432, bias=True)
[rank0]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:       )
[rank0]:       (norm1_context): AdaLayerNormZero(
[rank0]:         (silu): SiLU()
[rank0]:         (linear): Linear(in_features=3072, out_features=18432, bias=True)
[rank0]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:       )
[rank0]:       (attn): Attention(
[rank0]:         (norm_q): RMSNorm()
[rank0]:         (norm_k): RMSNorm()
[rank0]:         (to_q): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (to_k): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (to_v): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (add_k_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (add_v_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (add_q_proj): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (to_out): ModuleList(
[rank0]:           (0): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:           (1): Dropout(p=0.0, inplace=False)
[rank0]:         )
[rank0]:         (to_add_out): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (norm_added_q): RMSNorm()
[rank0]:         (norm_added_k): RMSNorm()
[rank0]:       )
[rank0]:       (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:       (ff): FeedForward(
[rank0]:         (net): ModuleList(
[rank0]:           (0): GELU(
[rank0]:             (proj): Linear(in_features=3072, out_features=12288, bias=True)
[rank0]:           )
[rank0]:           (1): Dropout(p=0.0, inplace=False)
[rank0]:           (2): Linear(in_features=12288, out_features=3072, bias=True)
[rank0]:         )
[rank0]:       )
[rank0]:       (norm2_context): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:       (ff_context): FeedForward(
[rank0]:         (net): ModuleList(
[rank0]:           (0): GELU(
[rank0]:             (proj): Linear(in_features=3072, out_features=12288, bias=True)
[rank0]:           )
[rank0]:           (1): Dropout(p=0.0, inplace=False)
[rank0]:           (2): Linear(in_features=12288, out_features=3072, bias=True)
[rank0]:         )
[rank0]:       )
[rank0]:     )
[rank0]:   )
[rank0]:   (single_transformer_blocks): ModuleList(
[rank0]:     (0-37): 38 x FluxSingleTransformerBlock(
[rank0]:       (norm): AdaLayerNormZeroSingle(
[rank0]:         (silu): SiLU()
[rank0]:         (linear): Linear(in_features=3072, out_features=9216, bias=True)
[rank0]:         (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:       )
[rank0]:       (proj_mlp): Linear(in_features=3072, out_features=12288, bias=True)
[rank0]:       (act_mlp): GELU(approximate='tanh')
[rank0]:       (proj_out): Linear(in_features=15360, out_features=3072, bias=True)
[rank0]:       (attn): Attention(
[rank0]:         (norm_q): RMSNorm()
[rank0]:         (norm_k): RMSNorm()
[rank0]:         (to_q): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (to_k): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:         (to_v): Linear(in_features=3072, out_features=3072, bias=True)
[rank0]:       )
[rank0]:     )
[rank0]:   )
[rank0]:   (norm_out): AdaLayerNormContinuous(
[rank0]:     (silu): SiLU()
[rank0]:     (linear): Linear(in_features=3072, out_features=6144, bias=True)
[rank0]:     (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
[rank0]:   )
[rank0]:   (proj_out): Linear(in_features=3072, out_features=64, bias=True)
[rank0]: ) does not have a parameter or a buffer named _flat_param.
Steps: 100%|███████████████████| 2/2 [05:44<00:00, 172.47s/it, loss=0.456, lr=1]
W0923 15:49:12.822612 140457362716480 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 164009 closing signal SIGTERM
E0923 15:49:15.844907 140457362716480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 164010) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_15:49:12
  host      : x2-h100.internal.cloudapp.net
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 164010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@kopyl
Copy link
Contributor Author

kopyl commented Sep 23, 2024

The transformer it saved does not load properly:
image

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 transformer = FluxTransformer2DModel.from_pretrained("flux-dreambooth-outputs/dreamboot-yaremovaa/transformer")

File [/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:114](http://48.217.82.250/lab/tree/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py#line=113), in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    111 if check_use_auth_token:
    112     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File [/diffusers/src/diffusers/models/modeling_utils.py:774](http://48.217.82.250/lab/tree/diffusers/src/diffusers/models/modeling_utils.py#line=773), in ModelMixin.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    772     force_hook = False
    773 try:
--> 774     accelerate.load_checkpoint_and_dispatch(
    775         model,
    776         model_file if not is_sharded else index_file,
    777         device_map,
    778         max_memory=max_memory,
    779         offload_folder=offload_folder,
    780         offload_state_dict=offload_state_dict,
    781         dtype=torch_dtype,
    782         force_hooks=force_hook,
    783         strict=True,
    784     )
    785 except AttributeError as e:
    786     # When using accelerate loading, we do not have the ability to load the state
    787     # dict and rename the weight names manually. Additionally, accelerate skips
   (...)
    792     # names to the new non-deprecated names. Then we _greatly encourage_ the user to convert
    793     # the weights so we don't have to do this again.
    795     if "'Attention' object has no attribute" in str(e):

File [/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py:613](http://48.217.82.250/lab/tree/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py#line=612), in load_checkpoint_and_dispatch(model, checkpoint, device_map, max_memory, no_split_module_classes, offload_folder, offload_buffers, dtype, offload_state_dict, skip_keys, preload_module_classes, force_hooks, strict)
    611 if offload_state_dict is None and device_map is not None and "disk" in device_map.values():
    612     offload_state_dict = True
--> 613 load_checkpoint_in_model(
    614     model,
    615     checkpoint,
    616     device_map=device_map,
    617     offload_folder=offload_folder,
    618     dtype=dtype,
    619     offload_state_dict=offload_state_dict,
    620     offload_buffers=offload_buffers,
    621     strict=strict,
    622 )
    623 if device_map is None:
    624     return model

File [/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py:1878](http://48.217.82.250/lab/tree/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py#line=1877), in load_checkpoint_in_model(model, checkpoint, device_map, offload_folder, dtype, offload_state_dict, offload_buffers, keep_in_fp32_modules, offload_8bit_bnb, strict)
   1876                 offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
   1877         else:
-> 1878             set_module_tensor_to_device(
   1879                 model,
   1880                 param_name,
   1881                 param_device,
   1882                 value=param,
   1883                 dtype=new_dtype,
   1884                 fp16_statistics=fp16_statistics,
   1885             )
   1887 # Force Python to clean up.
   1888 del loaded_checkpoint

File [/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py:341](http://48.217.82.250/lab/tree/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py#line=340), in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    338     tensor_name = splits[-1]
    340 if tensor_name not in module._parameters and tensor_name not in module._buffers:
--> 341     raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
    342 is_buffer = tensor_name in module._buffers
    343 old_value = getattr(module, tensor_name)

ValueError: FluxTransformer2DModel(
  (pos_embed): FluxPosEmbed()
  (time_text_embed): CombinedTimestepGuidanceTextProjEmbeddings(
    (time_proj): Timesteps()
    (timestep_embedder): TimestepEmbedding(
      (linear_1): Linear(in_features=256, out_features=3072, bias=True)
      (act): SiLU()
      (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
    )
    (guidance_embedder): TimestepEmbedding(
      (linear_1): Linear(in_features=256, out_features=3072, bias=True)
      (act): SiLU()
      (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
    )
    (text_embedder): PixArtAlphaTextProjection(
      (linear_1): Linear(in_features=768, out_features=3072, bias=True)
      (act_1): SiLU()
      (linear_2): Linear(in_features=3072, out_features=3072, bias=True)
    )
  )
  (context_embedder): Linear(in_features=4096, out_features=3072, bias=True)
  (x_embedder): Linear(in_features=64, out_features=3072, bias=True)
  (transformer_blocks): ModuleList(
    (0-18): 19 x FluxTransformerBlock(
      (norm1): AdaLayerNormZero(
        (silu): SiLU()
        (linear): Linear(in_features=3072, out_features=18432, bias=True)
        (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
      )
      (norm1_context): AdaLayerNormZero(
        (silu): SiLU()
        (linear): Linear(in_features=3072, out_features=18432, bias=True)
        (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
      )
      (attn): Attention(
        (norm_q): RMSNorm()
        (norm_k): RMSNorm()
        (to_q): Linear(in_features=3072, out_features=3072, bias=True)
        (to_k): Linear(in_features=3072, out_features=3072, bias=True)
        (to_v): Linear(in_features=3072, out_features=3072, bias=True)
        (add_k_proj): Linear(in_features=3072, out_features=3072, bias=True)
        (add_v_proj): Linear(in_features=3072, out_features=3072, bias=True)
        (add_q_proj): Linear(in_features=3072, out_features=3072, bias=True)
        (to_out): ModuleList(
          (0): Linear(in_features=3072, out_features=3072, bias=True)
          (1): Dropout(p=0.0, inplace=False)
        )
        (to_add_out): Linear(in_features=3072, out_features=3072, bias=True)
        (norm_added_q): RMSNorm()
        (norm_added_k): RMSNorm()
      )
      (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
      (ff): FeedForward(
        (net): ModuleList(
          (0): GELU(
            (proj): Linear(in_features=3072, out_features=12288, bias=True)
          )
          (1): Dropout(p=0.0, inplace=False)
          (2): Linear(in_features=12288, out_features=3072, bias=True)
        )
      )
      (norm2_context): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
      (ff_context): FeedForward(
        (net): ModuleList(
          (0): GELU(
            (proj): Linear(in_features=3072, out_features=12288, bias=True)
          )
          (1): Dropout(p=0.0, inplace=False)
          (2): Linear(in_features=12288, out_features=3072, bias=True)
        )
      )
    )
  )
  (single_transformer_blocks): ModuleList(
    (0-37): 38 x FluxSingleTransformerBlock(
      (norm): AdaLayerNormZeroSingle(
        (silu): SiLU()
        (linear): Linear(in_features=3072, out_features=9216, bias=True)
        (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
      )
      (proj_mlp): Linear(in_features=3072, out_features=12288, bias=True)
      (act_mlp): GELU(approximate='tanh')
      (proj_out): Linear(in_features=15360, out_features=3072, bias=True)
      (attn): Attention(
        (norm_q): RMSNorm()
        (norm_k): RMSNorm()
        (to_q): Linear(in_features=3072, out_features=3072, bias=True)
        (to_k): Linear(in_features=3072, out_features=3072, bias=True)
        (to_v): Linear(in_features=3072, out_features=3072, bias=True)
      )
    )
  )
  (norm_out): AdaLayerNormContinuous(
    (silu): SiLU()
    (linear): Linear(in_features=3072, out_features=6144, bias=True)
    (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
  )
  (proj_out): Linear(in_features=3072, out_features=64, bias=True)
) does not have a parameter or a buffer named _flat_param.

@a-r-r-o-w
Copy link
Member

So do I need to set NCCL_TIMEOUT env variable to some large number? 1200 does not seem to be enough right now if it's the real cause.

[rank0]:[E923 15:13:12.047707465 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.

I don't think setting the environment variable is respected/implemented in accelerate. From a quick look at the accelerate codebase, I did not find anything that hinted at this being used. You will have to set the timeout by passing an InitGroupProcessKwargs object as done here. FSDP, in my experience, does take a long time to save the model so something like 1800 should be a safe number.

@kopyl
Copy link
Contributor Author

kopyl commented Sep 25, 2024

@a-r-r-o-w this did not help :(
I shared the details here.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 24, 2024
@leisuzz
Copy link
Contributor

leisuzz commented Nov 1, 2024

Can you try regarding #9829 ? I have saved memory by implementing this :)

@github-actions github-actions bot removed the stale Issues that haven't received updates label Nov 1, 2024
@kopyl
Copy link
Contributor Author

kopyl commented Nov 11, 2024

@leisuzz Thanks :)
So how much VRAM do you need with your changes?)

@leisuzz
Copy link
Contributor

leisuzz commented Nov 11, 2024

@kopyl Before the modification, it just stuck with almost the same VRAM when training with one GPU. After that, it can save in a fast speed. But I didn't check.

Copy link

github-actions bot commented Dec 6, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 6, 2024
@kopyl
Copy link
Contributor Author

kopyl commented Dec 9, 2024

@leisuzz @leisuzz still hetting GPU OOM errors with 95 GB of VRAM...

@github-actions github-actions bot removed the stale Issues that haven't received updates label Dec 10, 2024
@leisuzz
Copy link
Contributor

leisuzz commented Dec 17, 2024

@kopyl I think this issue is related to accelerator, my algo saves memory in the training process. But your issue is in the last saving step. Can you change accelerator.is_main_process to if accelerator.is_main_process or accelerator.distributed_type == DistributedType.DEEPSPEED: to see how it goes?

@leisuzz
Copy link
Contributor

leisuzz commented Dec 17, 2024

@kopyl In the saving section I mean. add "if accelerator.is_main_process:" add or accelerator.distributed_type == DistributedType.DEEPSPEED, and don't forget to import DistributedType from accelerate

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 10, 2025
@kopyl
Copy link
Contributor Author

kopyl commented Jan 21, 2025

@leisuzz i no longer have the access to the server, so I can't check it now, sorry.

But i decided to switch to Kohya-ss sd-scripts for the training. Turned out to be much more stable.

@github-actions github-actions bot removed the stale Issues that haven't received updates label Jan 21, 2025
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

3 participants