-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on_save_checkpoint callbacks runs in rank zero only #3545
Comments
With this thread i wanted to raise awareness of a limitation in the callback API here and a proposed workaround. if the code complexity isn't palatable, then we should document the limitations of the API. Removing the decorator from It's also unclear what we need to do for the model checkpoint's own definitions of Note, this is really tricky because someone could define a different model checkpoint callback, and that callback might not have all these cc @jeremyjordan wdyt about this? |
@williamFalcon PTAL |
#3688 finished this |
🐛 Bug
If any callback implements
on_save_checkpoint
, then that function runs only in the rank zero worker. I think this is suboptimal as you might want to do some communication across workers before saving state.The lineage of calls here is:
I think this could be avoided with more judicious usage of
rank_zero_only
. the main benefit ofrank_zero_only
in the model checkpoint callback to avoid redundant file I/O. For saving the checkpoint, that is taken care of by this check.Other file I/O in the model checkpoint callback could be similarly guarded, and we should remove the decorator from
on_validation_end
The text was updated successfully, but these errors were encountered: