-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dreambooth Flux training failed on saving a checkpoint #9500
Comments
cc: @sayakpaul |
Cc: @linoytsaban |
@linoytsaban I can call to debug it together on my hardware if needed :) |
The default NCCL Timeout duration is 600 seconds: here. Sometimes validation on multiple prompts, or saving a FSDP model, can take longer than this. I would suggest to increase this timeout to 1800 seconds, which usually fixes any timeout problems for me. You can do this by: + from accelerate.utils import InitProcessGroupKwargs
+ from datetime import timedelta
...
accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+ init_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1800))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
- kwargs_handlers=[ddp_kwargs],
+ kwargs_handlers=[ddp_kwargs, init_kwargs],
) You might, sometimes, also have communication timeouts when using multi-GPU training. I've found using |
So I should have NCCL_P2P_DISABLE=1 as an env variable, correct? Like NCCL_P2P_DISABLE=1 accelerate ... |
That should be used when/if you experience any communication timeouts (but should be safe to use anyway). Currently, you're experiencing stale timeouts (because allgather did not happen for 600 seconds) which should be fixable, hopefully, by passing InitProcessGroupKwargs with a timeout of 1800 seconds. |
@a-r-r-o-w setting a timeout to 3600 sec did not help :( I launched the training with this command:
It created an empty directory at path Then it timed out. Logs after the timeout: https://pastebin.com/QU1mJP8U . Then I tried running the command with
The same error. Did not save anything for 1 hour. Please share a config for training Flux on x1 H100 NVL (95 GB VRAM), utilizing the max out of a single GPU before using a CPU. |
Just tried training on a signle H100 with this accelerate config:
And I got this error:
|
@a-r-r-o-w i just tried a good old sd 1.5 dreambooth on x1 H100 with this command:
The training went smoothly and saved the checkpoint. I'm wondering what might be the cause. Is it my accelerate config or faulty H100 or anything else... Do you have any ideas how can I debug it? |
@a-r-r-o-w i just tried running sd 1.5 dreambooth training with a basic config from this command:
The config looks like this:
So yeah, seems like something is going on with my config. |
huggingface/accelerate#2787 might be relevant. |
@a-r-r-o-w with the default accelerate config the Drembooth Flux training does not even start. Getting this error:
|
@sayakpaul with your config i still can't run the training. Getting this error: |
@a-r-r-o-w i commented out if transformer.config.guidance_embeds: and now I run out of memory. I have x2 H100 NVL and it's still not enough. Please share an accelerate config I can use to run the training without running out of memory and renting more powerful GPU servers. I'd really appreciate it. |
@sayakpaul same thing with your config. I'm running out of memory :( |
I'm sorry for the inconvenience this causes. Our training scripts serve as minimal examples of training and are not the end solution for training with different configurations. They are usually tested on basic uncompiled/compiled single GPU training scenarios only. The expectation is that people wanting to train seriously will adopt it to their use cases and make the best of it. So, things like DeepSpeed/FSDP may not work out of the box, and might require extra efforts from your end to make it compatible. I think tailoring the script to your needs is the best way to go about it for FSDP or any other training configuration.
This happens because DeepSpeed/FSDP wrap the underlying object in a new class. You can see here how it's done for DeepSpeed. You might have to do something similar to access underlying config object when using FSDP. |
@a-r-r-o-w Thank you. Do you have any idea why with FSDP it doesn't save a model with |
@a-r-r-o-w also it would be nice to see the exact config the training was tested on, so I can reproduce it. Both hardware-wise and software-wise. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Can you try regarding #9829 ? I have saved memory by implementing this :) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is this still relevant? From what I understand this issue happens when training Flux with LoRA and FSDP? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@sayakpaul i switched to training kohya, but yeah, there are still issues – both when saving and CUDA OOM errors. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
I run the training but get this error
Reproduction
Run accelerate config
Logs
System Info
Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5a
Who can help?
No response
The text was updated successfully, but these errors were encountered: