Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on_save_checkpoint callbacks runs in rank zero only #3545

Closed
ananthsub opened this issue Sep 18, 2020 · 3 comments
Closed

on_save_checkpoint callbacks runs in rank zero only #3545

ananthsub opened this issue Sep 18, 2020 · 3 comments
Assignees
Labels
bug Something isn't working design Includes a design discussion discussion In a discussion stage help wanted Open to be worked on
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Sep 18, 2020

🐛 Bug

If any callback implements on_save_checkpoint, then that function runs only in the rank zero worker. I think this is suboptimal as you might want to do some communication across workers before saving state.

The lineage of calls here is:

I think this could be avoided with more judicious usage of rank_zero_only. the main benefit of rank_zero_only in the model checkpoint callback to avoid redundant file I/O. For saving the checkpoint, that is taken care of by this check.

Other file I/O in the model checkpoint callback could be similarly guarded, and we should remove the decorator from on_validation_end

@ananthsub ananthsub added bug Something isn't working help wanted Open to be worked on labels Sep 18, 2020
@Borda Borda added the discussion In a discussion stage label Sep 18, 2020
@ananthsub
Copy link
Contributor Author

ananthsub commented Sep 18, 2020

With this thread i wanted to raise awareness of a limitation in the callback API here and a proposed workaround. if the code complexity isn't palatable, then we should document the limitations of the API.
making the implementation in on_validation_end have some more checks for trainer.is_global_zero rather than relying on the catch-all decorator isn't so bad. we're still able to apply the rank_zero_only decorator to functions that do only file I/O, like _del_model

Removing the decorator from on_validation_epoch_end means we must also remove the decorator from on_pretrain_routine_start because state is set in on_pretrain_routine_start that is later read in on_validation_end. if on_pretrain_routine_start runs on rank zero only, then there are uninitialized fields when on_validation_end runs in non-zero ranks.

It's also unclear what we need to do for the model checkpoint's own definitions of on_save_checkpoint. In general, do all callbacks implementing on_save_checkpoint now need to check for trainer.is_global_zero? is that a dealbreaker?

Note, this is really tricky because someone could define a different model checkpoint callback, and that callback might not have all these rank_zero_only decorators. This means that other callbacks which define on_save_checkpoint now have different behavior because their execution depends on the model checkpoint callback! Those functions could now be running on all ranks. This could lead to some really tough bugs. IMO, the assumption when writing callbacks should be that this function will run on all ranks, and folks should add explicit checks in their code (e.g. trainer.is_global_zero if they want to narrow it down further)

cc @jeremyjordan wdyt about this?

@edenlightning edenlightning added the design Includes a design discussion label Sep 18, 2020
@edenlightning
Copy link
Contributor

@williamFalcon PTAL

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 21, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 23, 2020
@ananthsub
Copy link
Contributor Author

#3688 finished this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design Includes a design discussion discussion In a discussion stage help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants