Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck at epoch 1 steps 0 #3081

Open
Red-Scarff opened this issue Feb 12, 2025 · 1 comment
Open

Training stuck at epoch 1 steps 0 #3081

Red-Scarff opened this issue Feb 12, 2025 · 1 comment

Comments

@Red-Scarff
Copy link

Well my Lora training got stuck at epoch 1 steps 0. What makes me more confused is my gpu util got 100%, but the training progress bar didn't move on at all even after I wait for an hour.
By the way I actually have already trained Lora through kohyass several times, and I didn't change anything since last time I trained successfully.
And here is my command:

20:01:31-202820 INFO Start training LoRA Standard ...
20:01:31-203676 INFO Validating lr scheduler arguments...
20:01:31-204184 INFO Validating optimizer arguments...
20:01:31-204674 INFO Validating /home/tione/notebook/kohya_ss/logs existence and writability... SUCCESS
20:01:31-205191 INFO Validating /home/tione/notebook/kohya_ss/outputs/Batik existence and writability...
SUCCESS
20:01:31-205741 INFO Validating
/home/tione/notebook/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safeten
sors existence... SUCCESS
20:01:31-206287 INFO Validating /home/tione/notebook/lora/Batik/images existence... SUCCESS
20:01:31-206806 INFO Folder 4_Bat Batik: 4 repeats found
20:01:31-207413 INFO Folder 4_Bat Batik: 195 images found
20:01:31-207884 INFO Folder 4_Bat Batik: 195 * 4 = 780 steps
20:01:31-208350 INFO Regulatization factor: 1
20:01:31-208790 INFO Total steps: 780
20:01:31-209197 INFO Train batch size: 16
20:01:31-209599 INFO Gradient accumulation steps: 1
20:01:31-210021 INFO Epoch: 10
20:01:31-210420 INFO Max train steps: 1600
20:01:31-210834 INFO stop_text_encoder_training = 0
20:01:31-211250 INFO lr_warmup_steps = 160
20:01:31-212065 INFO Saving training config to
/home/tione/notebook/kohya_ss/outputs/Batik/Batik_v1_20250212-200131.json...
20:01:31-213038 INFO Executing command: /root/miniforge3/envs/kohya_ss/bin/accelerate launch --dynamo_backend
no --dynamo_mode default --gpu_ids 0,1,2,3,4,5,6,7 --mixed_precision bf16 --multi_gpu
--num_processes 8 --num_machines 1 --num_cpu_threads_per_process 2
/home/tione/notebook/kohya_ss/sd-scripts/sdxl_train_network.py --config_file
/home/tione/notebook/kohya_ss/outputs/Batik/config_lora-20250212-200131.toml
--network_train_unet_only
20:01:31-214494 INFO Command executed.

And here is where I got stuck:

running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 780
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 7
num epochs / epoch数: 10
batch size per device / バッチサイズ: 16
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 70
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
steps: 0%| | 0/70 [00:00<?, ?it/s]2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703

epoch 1/10
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703

Can anyone help me? Or do you need more info and log to check?
tks a lot

@Red-Scarff
Copy link
Author

After waiting for a long time, I finally get this log:

terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E212 20:45:01.825869991 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=25, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5e9e96c1b6 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f5e4d029c74 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f5e4d02b7d0 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5e4d02c6ed in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f5e9f13d5c0 in /root/miniforge3/envs/kohya_ss/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f5e9ff7dac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f5ea000f850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

And a good news is I can still train Lora with a single GPU normally. So the problem may associates with communication of multi GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant