Dreambooth Flux training failed on saving a checkpoint #9500

kopyl · 2024-09-23T14:56:14Z

Describe the bug

I run the training but get this error

Reproduction

Run accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

!git clone https://github.com/huggingface/diffusers
%cd diffusers

%cd diffusers

!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt

import huggingface_hub
huggingface_hub.notebook_login()

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --seed="0"

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 15947.92it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 17476.27it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  7.41it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  3.43it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 9446.63it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 77672.30it/s]
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 14:05:08 - INFO - __main__ - ***** Running training *****
09/23/2024 14:05:08 - INFO - __main__ -   Num examples = 10
09/23/2024 14:05:08 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 14:05:08 - INFO - __main__ -   Num Epochs = 500
09/23/2024 14:05:08 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 14:05:08 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 14:05:08 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 14:05:08 - INFO - __main__ -   Total optimization steps = 1000
Steps:   0%|                                           | 0/1000 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                           | 0/1000 [00:25<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:26<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:27<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 1/1000 [00:30<8:20:36, 30.07s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:32<3:50:25, 13.85s/it, loss=0.456, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:33<3:50:25, 13.85s/it, loss=0.563, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:34<3:50:25, 13.85s/it, loss=0.355, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 2/1000 [00:36<3:50:25, 13.85s/it, loss=0.399, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 3/1000 [00:38<2:51:47, 10.34s/it, loss=0.438, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:40<1:58:38,  7.15s/it, loss=0.347, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:42<1:58:38,  7.15s/it, loss=0.585, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:43<1:58:38,  7.15s/it, loss=0.652, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 4/1000 [00:44<1:58:38,  7.15s/it, loss=0.336, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|               | 5/1000 [00:46<1:51:42,  6.74s/it, loss=0.323, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   1%|               | 6/1000 [00:49<1:28:16,  5.33s/it, loss=0.622, lr=1]Passing `txt_ids` ...
Steps:  50%|███████▍       | 499/1000 [35:19<37:43,  4.52s/it, loss=0.531, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|███████▌       | 500/1000 [35:21<32:10,  3.86s/it, loss=0.531, lr=1]09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-500
09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving FSDP model
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[rank0]:[E923 14:50:30.723437511 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank0]:[E923 14:50:30.723605561 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.752543374 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank1]:[E923 14:50:30.752732295 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965671382 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965698772 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E923 14:50:30.965704392 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E923 14:50:30.975447887 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f05327a0a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E923 14:50:31.301026792 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank0]:[E923 14:50:31.301055303 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 14:50:31.301060703 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 14:50:31.302001924 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f7f6bd63a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)

W0923 14:51:08.002416 140631144875840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 97042 closing signal SIGTERM
W0923 14:51:38.003378 140631144875840 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 97042 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0923 14:51:44.734884 140631144875840 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 97041) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_14:51:08
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 97041)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 97041
======================================================

System Info

Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5a

Who can help?

No response

The text was updated successfully, but these errors were encountered:

asomoza · 2024-09-23T15:48:54Z

cc: @sayakpaul

sayakpaul · 2024-09-23T16:47:46Z

Cc: @linoytsaban

kopyl · 2024-09-23T17:05:01Z

@linoytsaban I can call to debug it together on my hardware if needed :)

a-r-r-o-w · 2024-09-23T20:54:28Z

The default NCCL Timeout duration is 600 seconds: here. Sometimes validation on multiple prompts, or saving a FSDP model, can take longer than this. I would suggest to increase this timeout to 1800 seconds, which usually fixes any timeout problems for me.

You can do this by:

+ from accelerate.utils import InitProcessGroupKwargs
+ from datetime import timedelta

...

accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+ init_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1800))
accelerator = Accelerator(
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    mixed_precision=args.mixed_precision,
    log_with=args.report_to,
    project_config=accelerator_project_config,
-    kwargs_handlers=[ddp_kwargs],
+    kwargs_handlers=[ddp_kwargs, init_kwargs],
)

You might, sometimes, also have communication timeouts when using multi-GPU training. I've found using NCCL_P2P_DISABLE=1 fixes. But, I don't fully understand the details/consequences so please look into appropriate docs if you get any errors regarding this.

kopyl · 2024-09-23T22:21:32Z

So I should have NCCL_P2P_DISABLE=1 as an env variable, correct? Like

NCCL_P2P_DISABLE=1 accelerate ...

a-r-r-o-w · 2024-09-23T22:25:37Z

That should be used when/if you experience any communication timeouts (but should be safe to use anyway). Currently, you're experiencing stale timeouts (because allgather did not happen for 600 seconds) which should be fixable, hopefully, by passing InitProcessGroupKwargs with a timeout of 1800 seconds.

kopyl · 2024-09-25T13:25:46Z

@a-r-r-o-w setting a timeout to 3600 sec did not help :(

I launched the training with this command:

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2 \
  --seed="0" \
  --checkpointing_steps=1

It created an empty directory at path /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0 and stayed empty for 60 minutes. Logs: https://pastebin.com/9SG3tehh

Then it timed out. Logs after the timeout: https://pastebin.com/QU1mJP8U .

Then I tried running the command with NCCL_P2P_DISABLE=1 like this:

!NCCL_P2P_DISABLE=1 accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2 \
  --seed="0" \
  --checkpointing_steps=1

The same error. Did not save anything for 1 hour.
I refuse to believe that saving a checkpoint takes more than 1 hour. Can it be true?
Logs.

Please share a config for training Flux on x1 H100 NVL (95 GB VRAM), utilizing the max out of a single GPU before using a CPU.

kopyl · 2024-09-25T13:47:32Z

Just tried training on a signle H100 with this accelerate config:

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

And I got this error:

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/25/2024 13:41:54 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 15534.46it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.63it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 7117.03it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py:440: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
09/25/2024 13:42:09 - INFO - __main__ - ***** Running training *****
09/25/2024 13:42:09 - INFO - __main__ -   Num examples = 10
09/25/2024 13:42:09 - INFO - __main__ -   Num batches each epoch = 10
09/25/2024 13:42:09 - INFO - __main__ -   Num Epochs = 1
09/25/2024 13:42:09 - INFO - __main__ -   Instantaneous batch size per device = 1
09/25/2024 13:42:09 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
09/25/2024 13:42:09 - INFO - __main__ -   Gradient Accumulation steps = 4
09/25/2024 13:42:09 - INFO - __main__ -   Total optimization steps = 1
Steps:   0%|                                              | 0/1 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                             | 0/1 [01:14<?, ?it/s, loss=0.51, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/1 [01:17<?, ?it/s, loss=0.417, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/1 [01:20<?, ?it/s, loss=0.338, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
E0925 13:44:45.988664 140466987263808 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 48214) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_13:44:45
  host      : x1-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 48214)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 48214
======================================================

kopyl · 2024-09-25T14:34:04Z

@a-r-r-o-w i just tried a good old sd 1.5 dreambooth on x1 H100 with this command:

MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
INSTANCE_DIR="/home/azureuser/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/home/azureuser/sd15-dreambooth-outputs/dreamboot-yaremovaa"
CLASS_DIR="/home/azureuser/dreambooth-datasets-class/girl"

!NCCL_P2P_DISABLE=1 accelerate launch examples/dreambooth/train_dreambooth.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --class_data_dir={CLASS_DIR} \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks girl" \
  --class_prompt="a photo of girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --enable_xformers_memory_efficient_attention \
  --set_grads_to_none \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=10

The training went smoothly and saved the checkpoint. I'm wondering what might be the cause. Is it my accelerate config or faulty H100 or anything else...

Do you have any ideas how can I debug it?

kopyl · 2024-09-25T14:39:14Z

@a-r-r-o-w i just tried running sd 1.5 dreambooth training with a basic config from this command:

from accelerate.utils import write_basic_config
write_basic_config()

The config looks like this:

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 2,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

So yeah, seems like something is going on with my config.
Do you have any ideas how can I change it to fit my training on x2 H100 NVL (each 95 GB VRAM) and avoid having timeouts?

sayakpaul · 2024-09-25T14:49:28Z

huggingface/accelerate#2787 might be relevant.

kopyl · 2024-09-25T14:58:24Z

@a-r-r-o-w with the default accelerate config the Drembooth Flux training does not even start. Getting this error:

[W925 14:52:57.109944407 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 14:52:57.109972687 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W925 14:52:57.134315258 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 14:52:57.134335928 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
09/25/2024 14:52:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/25/2024 14:52:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 14193.92it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 13774.40it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.31s/it]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10246.67it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.31s/it]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10123.02it/s]
Using decoupled weight decay
Using decoupled weight decay
x2-h100:26855:26855 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26855:26855 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:26855:26855 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:26855:26855 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
x2-h100:26856:26856 [1] NCCL INFO cudaDriverVersion 12020
x2-h100:26856:26856 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26856:26856 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:26856:26856 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:26855:27239 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:26855:27239 [0] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:26855:27239 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26855:27239 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:26855:27239 [0] NCCL INFO Using non-device net plugin version 0
x2-h100:26855:27239 [0] NCCL INFO Using network Socket
x2-h100:26856:27240 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:26856:27240 [1] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:26856:27240 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26856:27240 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:26856:27240 [1] NCCL INFO Using non-device net plugin version 0
x2-h100:26856:27240 [1] NCCL INFO Using network Socket
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xcd8f216bbce7889c - Init START
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xcd8f216bbce7889c - Init START
x2-h100:26856:27240 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:26856:27240 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
x2-h100:26855:27239 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:26855:27239 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
x2-h100:26856:27240 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
x2-h100:26855:27239 [0] NCCL INFO Channel 00/04 :    0   1
x2-h100:26856:27240 [1] NCCL INFO P2P Chunksize set to 131072
x2-h100:26855:27239 [0] NCCL INFO Channel 01/04 :    0   1
x2-h100:26855:27239 [0] NCCL INFO Channel 02/04 :    0   1
x2-h100:26855:27239 [0] NCCL INFO Channel 03/04 :    0   1
x2-h100:26855:27239 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
x2-h100:26855:27239 [0] NCCL INFO P2P Chunksize set to 131072
x2-h100:26856:27240 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Connected all rings
x2-h100:26855:27239 [0] NCCL INFO Connected all trees
x2-h100:26856:27240 [1] NCCL INFO Connected all rings
x2-h100:26856:27240 [1] NCCL INFO Connected all trees
x2-h100:26856:27240 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:26856:27240 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:26855:27239 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:26855:27239 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xcd8f216bbce7889c - Init COMPLETE
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xcd8f216bbce7889c - Init COMPLETE
[rank0]:[W925 14:53:20.324185688 Utils.hpp:110] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[rank1]:[W925 14:53:20.324948373 Utils.hpp:110] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
09/25/2024 14:53:20 - INFO - __main__ - ***** Running training *****
09/25/2024 14:53:20 - INFO - __main__ -   Num examples = 10
09/25/2024 14:53:20 - INFO - __main__ -   Num batches each epoch = 5
09/25/2024 14:53:20 - INFO - __main__ -   Num Epochs = 1
09/25/2024 14:53:20 - INFO - __main__ -   Instantaneous batch size per device = 1
09/25/2024 14:53:20 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/25/2024 14:53:20 - INFO - __main__ -   Gradient Accumulation steps = 4
09/25/2024 14:53:20 - INFO - __main__ -   Total optimization steps = 2
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank0]:     main(args)
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1585, in main
[rank0]:     if transformer.config.guidance_embeds:
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'config'
[rank1]: Traceback (most recent call last):
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank1]:     main(args)
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1585, in main
[rank1]:     if transformer.config.guidance_embeds:
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank1]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'config'
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s]
[rank0]:[W925 14:53:21.913382942 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
x2-h100:26855:27245 [0] NCCL INFO [Service thread] Connection closed by localRank 0
x2-h100:26856:27244 [1] NCCL INFO [Service thread] Connection closed by localRank 1
x2-h100:26855:27271 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
x2-h100:26856:27272 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
W0925 14:53:22.304201 140655254574912 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 26856 closing signal SIGTERM
E0925 14:53:22.618849 140655254574912 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 26855) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_14:53:22
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 26855)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kopyl · 2024-09-25T15:09:57Z

@sayakpaul with your config i still can't run the training. Getting this error:
https://pastebin.com/xaqNSh9K

kopyl · 2024-09-25T16:27:50Z

@a-r-r-o-w i commented out if transformer.config.guidance_embeds: and now I run out of memory.

I have x2 H100 NVL and it's still not enough. Please share an accelerate config I can use to run the training without running out of memory and renting more powerful GPU servers. I'd really appreciate it.

kopyl · 2024-09-25T16:28:57Z

@sayakpaul same thing with your config. I'm running out of memory :(

a-r-r-o-w · 2024-09-27T10:38:17Z

@a-r-r-o-w setting a timeout to 3600 sec did not help :(
The same error. Did not save anything for 1 hour.
I refuse to believe that saving a checkpoint takes more than 1 hour. Can it be true?
Logs.

I'm sorry for the inconvenience this causes. Our training scripts serve as minimal examples of training and are not the end solution for training with different configurations. They are usually tested on basic uncompiled/compiled single GPU training scenarios only. The expectation is that people wanting to train seriously will adopt it to their use cases and make the best of it. So, things like DeepSpeed/FSDP may not work out of the box, and might require extra efforts from your end to make it compatible. I think tailoring the script to your needs is the best way to go about it for FSDP or any other training configuration.

i commented out if transformer.config.guidance_embeds: and now I run out of memory.

This happens because DeepSpeed/FSDP wrap the underlying object in a new class. You can see here how it's done for DeepSpeed. You might have to do something similar to access underlying config object when using FSDP.

kopyl · 2024-09-27T13:02:21Z

@a-r-r-o-w Thank you. Do you have any idea why with FSDP it doesn't save a model with save_pretrained method? Maybe there is also some different way to access it?

kopyl · 2024-09-27T13:03:59Z

@a-r-r-o-w also it would be nice to see the exact config the training was tested on, so I can reproduce it. Both hardware-wise and software-wise.

github-actions · 2024-10-23T15:02:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-10-23T18:52:47Z

So, this is how we're doing it here:
https://github.com/a-r-r-o-w/cogvideox-factory/blob/0affacb2296027fc40a6f3900ce9157b4f4ea46d/training/cogvideox_image_to_video_lora.py#L382

leisuzz · 2024-11-01T05:02:27Z

Can you try regarding #9829 ? I have saved memory by implementing this :)

github-actions · 2024-11-25T15:04:00Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-11-25T15:15:29Z

Is this still relevant? From what I understand this issue happens when training Flux with LoRA and FSDP?

github-actions · 2024-12-20T15:04:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kopyl · 2024-12-22T18:07:34Z

Is this still relevant? From what I understand this issue happens when training Flux with LoRA and FSDP?

@sayakpaul i switched to training kohya, but yeah, there are still issues – both when saving and CUDA OOM errors.
Nope, full fine-tune.

github-actions · 2025-01-16T15:05:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kopyl added the bug Something isn't working label Sep 23, 2024

a-r-r-o-w mentioned this issue Sep 23, 2024

Dreambooth Flux training does not save a model for around 10-15 minutes #9501

Open

github-actions bot added the stale Issues that haven't received updates label Oct 23, 2024

sayakpaul removed the stale Issues that haven't received updates label Oct 23, 2024

leisuzz mentioned this issue Nov 1, 2024

Reduce Memory Cost in Flux Training #9829

Merged

6 tasks

github-actions bot added the stale Issues that haven't received updates label Nov 25, 2024

github-actions bot removed the stale Issues that haven't received updates label Nov 26, 2024

github-actions bot added the stale Issues that haven't received updates label Dec 20, 2024

github-actions bot removed the stale Issues that haven't received updates label Dec 23, 2024

github-actions bot added the stale Issues that haven't received updates label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dreambooth Flux training failed on saving a checkpoint #9500

Dreambooth Flux training failed on saving a checkpoint #9500

kopyl commented Sep 23, 2024

asomoza commented Sep 23, 2024

sayakpaul commented Sep 23, 2024

kopyl commented Sep 23, 2024

a-r-r-o-w commented Sep 23, 2024 •

edited

Loading

kopyl commented Sep 23, 2024

a-r-r-o-w commented Sep 23, 2024 •

edited

Loading

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

sayakpaul commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

a-r-r-o-w commented Sep 27, 2024

kopyl commented Sep 27, 2024

kopyl commented Sep 27, 2024

github-actions bot commented Oct 23, 2024

sayakpaul commented Oct 23, 2024

leisuzz commented Nov 1, 2024

github-actions bot commented Nov 25, 2024

sayakpaul commented Nov 25, 2024

github-actions bot commented Dec 20, 2024

kopyl commented Dec 22, 2024

github-actions bot commented Jan 16, 2025

Dreambooth Flux training failed on saving a checkpoint #9500

Dreambooth Flux training failed on saving a checkpoint #9500

Comments

kopyl commented Sep 23, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

asomoza commented Sep 23, 2024

sayakpaul commented Sep 23, 2024

kopyl commented Sep 23, 2024

a-r-r-o-w commented Sep 23, 2024 • edited Loading

kopyl commented Sep 23, 2024

a-r-r-o-w commented Sep 23, 2024 • edited Loading

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

sayakpaul commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

kopyl commented Sep 25, 2024

a-r-r-o-w commented Sep 27, 2024

kopyl commented Sep 27, 2024

kopyl commented Sep 27, 2024

github-actions bot commented Oct 23, 2024

sayakpaul commented Oct 23, 2024

leisuzz commented Nov 1, 2024

github-actions bot commented Nov 25, 2024

sayakpaul commented Nov 25, 2024

github-actions bot commented Dec 20, 2024

kopyl commented Dec 22, 2024

github-actions bot commented Jan 16, 2025

a-r-r-o-w commented Sep 23, 2024 •

edited

Loading

a-r-r-o-w commented Sep 23, 2024 •

edited

Loading