You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Well my Lora training got stuck at epoch 1 steps 0. What makes me more confused is my gpu util got 100%, but the training progress bar didn't move on at all even after I wait for an hour.
By the way I actually have already trained Lora through kohyass several times, and I didn't change anything since last time I trained successfully.
And here is my command:
20:01:31-202820 INFO Start training LoRA Standard ...
20:01:31-203676 INFO Validating lr scheduler arguments...
20:01:31-204184 INFO Validating optimizer arguments...
20:01:31-204674 INFO Validating /home/tione/notebook/kohya_ss/logs existence and writability... SUCCESS
20:01:31-205191 INFO Validating /home/tione/notebook/kohya_ss/outputs/Batik existence and writability...
SUCCESS
20:01:31-205741 INFO Validating
/home/tione/notebook/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safeten
sors existence... SUCCESS
20:01:31-206287 INFO Validating /home/tione/notebook/lora/Batik/images existence... SUCCESS
20:01:31-206806 INFO Folder 4_Bat Batik: 4 repeats found
20:01:31-207413 INFO Folder 4_Bat Batik: 195 images found
20:01:31-207884 INFO Folder 4_Bat Batik: 195 * 4 = 780 steps
20:01:31-208350 INFO Regulatization factor: 1
20:01:31-208790 INFO Total steps: 780
20:01:31-209197 INFO Train batch size: 16
20:01:31-209599 INFO Gradient accumulation steps: 1
20:01:31-210021 INFO Epoch: 10
20:01:31-210420 INFO Max train steps: 1600
20:01:31-210834 INFO stop_text_encoder_training = 0
20:01:31-211250 INFO lr_warmup_steps = 160
20:01:31-212065 INFO Saving training config to
/home/tione/notebook/kohya_ss/outputs/Batik/Batik_v1_20250212-200131.json...
20:01:31-213038 INFO Executing command: /root/miniforge3/envs/kohya_ss/bin/accelerate launch --dynamo_backend
no --dynamo_mode default --gpu_ids 0,1,2,3,4,5,6,7 --mixed_precision bf16 --multi_gpu
--num_processes 8 --num_machines 1 --num_cpu_threads_per_process 2
/home/tione/notebook/kohya_ss/sd-scripts/sdxl_train_network.py --config_file
/home/tione/notebook/kohya_ss/outputs/Batik/config_lora-20250212-200131.toml
--network_train_unet_only
20:01:31-214494 INFO Command executed.
And here is where I got stuck:
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 780
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 7
num epochs / epoch数: 10
batch size per device / バッチサイズ: 16
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 70
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
steps: 0%| | 0/70 [00:00<?, ?it/s]2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
epoch 1/10
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
Can anyone help me? Or do you need more info and log to check?
tks a lot
The text was updated successfully, but these errors were encountered:
Well my Lora training got stuck at epoch 1 steps 0. What makes me more confused is my gpu util got 100%, but the training progress bar didn't move on at all even after I wait for an hour.
By the way I actually have already trained Lora through kohyass several times, and I didn't change anything since last time I trained successfully.
And here is my command:
20:01:31-202820 INFO Start training LoRA Standard ...
20:01:31-203676 INFO Validating lr scheduler arguments...
20:01:31-204184 INFO Validating optimizer arguments...
20:01:31-204674 INFO Validating /home/tione/notebook/kohya_ss/logs existence and writability... SUCCESS
20:01:31-205191 INFO Validating /home/tione/notebook/kohya_ss/outputs/Batik existence and writability...
SUCCESS
20:01:31-205741 INFO Validating
/home/tione/notebook/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safeten
sors existence... SUCCESS
20:01:31-206287 INFO Validating /home/tione/notebook/lora/Batik/images existence... SUCCESS
20:01:31-206806 INFO Folder 4_Bat Batik: 4 repeats found
20:01:31-207413 INFO Folder 4_Bat Batik: 195 images found
20:01:31-207884 INFO Folder 4_Bat Batik: 195 * 4 = 780 steps
20:01:31-208350 INFO Regulatization factor: 1
20:01:31-208790 INFO Total steps: 780
20:01:31-209197 INFO Train batch size: 16
20:01:31-209599 INFO Gradient accumulation steps: 1
20:01:31-210021 INFO Epoch: 10
20:01:31-210420 INFO Max train steps: 1600
20:01:31-210834 INFO stop_text_encoder_training = 0
20:01:31-211250 INFO lr_warmup_steps = 160
20:01:31-212065 INFO Saving training config to
/home/tione/notebook/kohya_ss/outputs/Batik/Batik_v1_20250212-200131.json...
20:01:31-213038 INFO Executing command: /root/miniforge3/envs/kohya_ss/bin/accelerate launch --dynamo_backend
no --dynamo_mode default --gpu_ids 0,1,2,3,4,5,6,7 --mixed_precision bf16 --multi_gpu
--num_processes 8 --num_machines 1 --num_cpu_threads_per_process 2
/home/tione/notebook/kohya_ss/sd-scripts/sdxl_train_network.py --config_file
/home/tione/notebook/kohya_ss/outputs/Batik/config_lora-20250212-200131.toml
--network_train_unet_only
20:01:31-214494 INFO Command executed.
And here is where I got stuck:
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 780
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 7
num epochs / epoch数: 10
batch size per device / バッチサイズ: 16
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 70
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
steps: 0%| | 0/70 [00:00<?, ?it/s]2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
epoch 1/10
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
2025-02-12 20:05:44 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:703
Can anyone help me? Or do you need more info and log to check?
tks a lot
The text was updated successfully, but these errors were encountered: