-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dreambooth Flux training does not save a model for around 10-15 minutes #9501
Comments
is it because it takes a lot of time to get the params from 2 GPU + CPU? So do I need to set If you ask me how to set it, you just add a new line in your bash profile file ( |
I just tried removing
|
I don't think setting the environment variable is respected/implemented in accelerate. From a quick look at the accelerate codebase, I did not find anything that hinted at this being used. You will have to set the timeout by passing an InitGroupProcessKwargs object as done here. FSDP, in my experience, does take a long time to save the model so something like 1800 should be a safe number. |
@a-r-r-o-w this did not help :( |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Can you try regarding #9829 ? I have saved memory by implementing this :) |
@leisuzz Thanks :) |
@kopyl Before the modification, it just stuck with almost the same VRAM when training with one GPU. After that, it can save in a fast speed. But I didn't check. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@kopyl I think this issue is related to accelerator, my algo saves memory in the training process. But your issue is in the last saving step. Can you change |
@kopyl In the saving section I mean. add "if accelerator.is_main_process:" add |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@leisuzz i no longer have the access to the server, so I can't check it now, sorry. But i decided to switch to Kohya-ss sd-scripts for the training. Turned out to be much more stable. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.
Reproduction
Run
accelerate config
Logs
System Info
Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5aebbac0cc18168076a18836f175753d1c7x
Who can help?
No response
The text was updated successfully, but these errors were encountered: