-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Accelerator can NOT save optimizer state #1247
Comments
Issue is updated! |
By doing above problem, I met another error.
Same Environment, but
Related Issue : deepspeed strategy can't save checkpoint, TypeError: cannot pickle torch._C._distributed_c10d.ProcessGroup object |
I've updated the code as same as the issue huggingface/diffusers#2606. However, I have no idea how to fix |
Same as title.
I'm in investigating.With following modification, scripts can save optimizer state only for small dataset, IDK why it does not work big dataset.Error messages will be attached here.
71e2c91330a9d866ec05cdd10584bbb962896a99
train_network.py
normallyHere is corresponding codes in sd-scripts. This structure is same as save_every_n_epoch.
Analysis: Only Rank 0(GPU 0, or cuda:0) is ready to save optimizer states. With ZeRO, stage above 0, optimizer state is distributed cross all gpus. So, in the block of is_main_process, accelerator wait forever rest of gpus(rank1, 2, 3) which never try to save optimizer state. Therefore, NCLL group raise timeout error. Of course, saving model is not a problem.
Related issue: get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora
make save state line out of is_main_process block.
This is ad-hoc modification for solving problem. And here are log of codes.
Related issue: get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora.
After modifying both block of save_state and function of save_model_hook, sd-scripts is
now able to save optimizer state when deepspeed=true.able to save state when using small dataset.The text was updated successfully, but these errors were encountered: