Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUX.1-dev dreambooth training problem on multigpu #9790

Closed
jyy-1998 opened this issue Oct 29, 2024 · 6 comments
Closed

FLUX.1-dev dreambooth training problem on multigpu #9790

jyy-1998 opened this issue Oct 29, 2024 · 6 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@jyy-1998
Copy link

Describe the bug

I tried to use accelerate+deepspeed to train flux, but every time after a dozen steps, an error occurred and the program crashed. Can anyone provide some help?

Reproduction

accelerate launch --config_file config.yaml train_flux.py
--pretrained_model_name_or_path="./FLUX.1-dev"
--resolution=1024
--train_batch_size=1
--output_dir="output0"
--num_train_epochs=10
--checkpointing_steps=5000
--validation_steps=100
--max_train_steps=40001
--learning_rate=4e-05
--seed=12345
--mixed_precision="fp16"
--revision="fp16"
--use_8bit_adam
--gradient_accumulation_steps=1
--gradient_checkpointing

compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: 0,1
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

the logs:
Steps: 0%| | 0/40001 [00:00<?, ?it/s]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:19:05,281] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Steps: 0%| | 1/40001 [00:14<165:03:11, 14.85s/it, loss=0.707, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:19:18,776] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Steps: 0%| | 2/40001 [00:28<156:09:31, 14.05s/it, loss=0.428, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:19:31,289] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Steps: 0%| | 3/40001 [00:40<148:20:21, 13.35s/it, loss=0.652, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:19:43,569] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Steps: 0%| | 4/40001 [00:53<143:38:06, 12.93s/it, loss=0.919, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:19:55,578] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
Steps: 0%| | 5/40001 [01:05<139:57:04, 12.60s/it, loss=0.439, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:20:07,647] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
Steps: 0%| | 6/40001 [01:17<137:57:02, 12.42s/it, loss=0.56, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:20:20,092] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
Steps: 0%| | 7/40001 [01:29<138:03:03, 12.43s/it, loss=0.694, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:20:31,513] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
Steps: 0%| | 8/40001 [01:41<134:29:31, 12.11s/it, loss=0.577, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:20:42,977] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216, reducing to 8388608
Steps: 0%| | 9/40001 [01:52<132:15:29, 11.91s/it, loss=0.363, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:20:54,396] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608, reducing to 4194304
Steps: 0%| | 10/40001 [02:03<130:35:01, 11.76s/it, loss=0.64, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:21:06,229] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304, reducing to 2097152
Steps: 0%| | 11/40001 [02:15<130:50:48, 11.78s/it, loss=0.574, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:21:17,911] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152, reducing to 1048576
Steps: 0%| | 12/40001 [02:27<130:30:42, 11.75s/it, loss=0.44, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-29 06:21:29,945] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576, reducing to 524288
Steps: 0%| | 13/40001 [02:39<131:28:04, 11.84s/it, loss=0.387, lr=0]Passing txt_ids 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11739 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 11738) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
deepspeed_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_flux.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-10-29_06:22:05
host : task-20241029103910-30817
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 11738)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 11738

System Info

deepspeed==0.14.4
accelerate==0.33.0
transformers==4.41.2

Who can help?

No response

@jyy-1998 jyy-1998 added the bug Something isn't working label Oct 29, 2024
@a-r-r-o-w
Copy link
Member

Could you enable verbose logging with Accelerate (ref) and paste the logs? This does not look like it contains any information that would help identify what the issue might be

@lyb369
Copy link

lyb369 commented Oct 30, 2024

Maybe it's because your cpu memory is not enough

@sayakpaul
Copy link
Member

sayakpaul commented Nov 1, 2024

I have investigated this before and I can confirm it works. See: #9278 (comment)

@leisuzz
Copy link
Contributor

leisuzz commented Nov 1, 2024

Can you try regarding #9829 ? I have saved memory by implementing this :)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 28, 2024
@sayakpaul
Copy link
Member

Gonna close this because of lack of activities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

5 participants