Training stuck at epoch 1 steps 0 #3081

Red-Scarff · 2025-02-12T12:09:44Z

Well my Lora training got stuck at epoch 1 steps 0. What makes me more confused is my gpu util got 100%, but the training progress bar didn't move on at all even after I wait for an hour.
By the way I actually have already trained Lora through kohyass several times, and I didn't change anything since last time I trained successfully.
And here is my command:

20:01:31-202820 INFO Start training LoRA Standard ...
20:01:31-203676 INFO Validating lr scheduler arguments...
20:01:31-204184 INFO Validating optimizer arguments...
20:01:31-204674 INFO Validating /home/tione/notebook/kohya_ss/logs existence and writability... SUCCESS
20:01:31-205191 INFO Validating /home/tione/notebook/kohya_ss/outputs/Batik existence and writability...
SUCCESS
20:01:31-205741 INFO Validating
/home/tione/notebook/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safeten
sors existence... SUCCESS
20:01:31-206287 INFO Validating /home/tione/notebook/lora/Batik/images existence... SUCCESS
20:01:31-206806 INFO Folder 4_Bat Batik: 4 repeats found
20:01:31-207413 INFO Folder 4_Bat Batik: 195 images found
20:01:31-207884 INFO Folder 4_Bat Batik: 195 * 4 = 780 steps
20:01:31-208350 INFO Regulatization factor: 1
20:01:31-208790 INFO Total steps: 780
20:01:31-209197 INFO Train batch size: 16
20:01:31-209599 INFO Gradient accumulation steps: 1
20:01:31-210021 INFO Epoch: 10
20:01:31-210420 INFO Max train steps: 1600
20:01:31-210834 INFO stop_text_encoder_training = 0
20:01:31-211250 INFO lr_warmup_steps = 160
20:01:31-212065 INFO Saving training config to
/home/tione/notebook/kohya_ss/outputs/Batik/Batik_v1_20250212-200131.json...
20:01:31-213038 INFO Executing command: /root/miniforge3/envs/kohya_ss/bin/accelerate launch --dynamo_backend
no --dynamo_mode default --gpu_ids 0,1,2,3,4,5,6,7 --mixed_precision bf16 --multi_gpu
--num_processes 8 --num_machines 1 --num_cpu_threads_per_process 2
/home/tione/notebook/kohya_ss/sd-scripts/sdxl_train_network.py --config_file
/home/tione/notebook/kohya_ss/outputs/Batik/config_lora-20250212-200131.toml
--network_train_unet_only
20:01:31-214494 INFO Command executed.

And here is where I got stuck:

running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 780
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 7
num epochs / epoch数: 10
batch size per device / バッチサイズ: 16
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 70
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
steps: 0%| | 0/70 [00:00<?, ?it/s]2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703

epoch 1/10
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703

Can anyone help me? Or do you need more info and log to check?
tks a lot

Red-Scarff · 2025-02-12T13:13:41Z

After waiting for a long time, I finally get this log:

terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E212 20:45:01.825869991 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=25, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5e9e96c1b6 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f5e4d029c74 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f5e4d02b7d0 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5e4d02c6ed in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f5e9f13d5c0 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f5e9ff7dac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f5ea000f850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

And a good news is I can still train Lora with a single GPU normally. So the problem may associates with communication of multi GPUs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stuck at epoch 1 steps 0 #3081

Training stuck at epoch 1 steps 0 #3081

Red-Scarff commented Feb 12, 2025

Red-Scarff commented Feb 12, 2025

Training stuck at epoch 1 steps 0 #3081

Training stuck at epoch 1 steps 0 #3081

Comments

Red-Scarff commented Feb 12, 2025

Red-Scarff commented Feb 12, 2025