on_save_checkpoint callbacks runs in rank zero only #3545

ananthsub · 2020-09-18T06:14:08Z

🐛 Bug

If any callback implements on_save_checkpoint, then that function runs only in the rank zero worker. I think this is suboptimal as you might want to do some communication across workers before saving state.

The lineage of calls here is:

I think this could be avoided with more judicious usage of rank_zero_only. the main benefit of rank_zero_only in the model checkpoint callback to avoid redundant file I/O. For saving the checkpoint, that is taken care of by this check.

Other file I/O in the model checkpoint callback could be similarly guarded, and we should remove the decorator from on_validation_end

The text was updated successfully, but these errors were encountered:

ananthsub · 2020-09-18T07:34:17Z

With this thread i wanted to raise awareness of a limitation in the callback API here and a proposed workaround. if the code complexity isn't palatable, then we should document the limitations of the API.
making the implementation in on_validation_end have some more checks for trainer.is_global_zero rather than relying on the catch-all decorator isn't so bad. we're still able to apply the rank_zero_only decorator to functions that do only file I/O, like _del_model

Removing the decorator from on_validation_epoch_end means we must also remove the decorator from on_pretrain_routine_start because state is set in on_pretrain_routine_start that is later read in on_validation_end. if on_pretrain_routine_start runs on rank zero only, then there are uninitialized fields when on_validation_end runs in non-zero ranks.

It's also unclear what we need to do for the model checkpoint's own definitions of on_save_checkpoint. In general, do all callbacks implementing on_save_checkpoint now need to check for trainer.is_global_zero? is that a dealbreaker?

Note, this is really tricky because someone could define a different model checkpoint callback, and that callback might not have all these rank_zero_only decorators. This means that other callbacks which define on_save_checkpoint now have different behavior because their execution depends on the model checkpoint callback! Those functions could now be running on all ranks. This could lead to some really tough bugs. IMO, the assumption when writing callbacks should be that this function will run on all ranks, and folks should add explicit checks in their code (e.g. trainer.is_global_zero if they want to narrow it down further)

cc @jeremyjordan wdyt about this?

edenlightning · 2020-09-18T12:43:29Z

@williamFalcon PTAL

Fix Lightning-AI#3545

ananthsub · 2020-10-03T22:29:03Z

#3688 finished this

ananthsub added bug Something isn't working help wanted Open to be worked on labels Sep 18, 2020

Borda added the discussion In a discussion stage label Sep 18, 2020

ananthsub mentioned this issue Sep 18, 2020

Add missing callback hook for optimizer dicts #3496

Closed

edenlightning added the design Includes a design discussion label Sep 18, 2020

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

891e7c2

Fix Lightning-AI#3545

ananthsub mentioned this issue Sep 19, 2020

Allow on_save_checkpoint callback functions to run on all ranks #3562

Closed

7 tasks

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

6aefbcf

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

95ae78b

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

51e2efd

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

ee284ba

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

4ebfa42

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

befc3e0

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

514eea2

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

0400639

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 19, 2020

Remove rank_zero_only decorators from model checkpoint entry points

bc8c1a6

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

4033047

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

9eefc1f

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

81c25da

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

8057d7e

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

dc984ae

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 20, 2020

Remove rank_zero_only decorators from model checkpoint entry points

bfcb839

Fix Lightning-AI#3545

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Sep 21, 2020

Remove rank_zero_only decorators from model checkpoint entry points

433e158

Fix Lightning-AI#3545

edenlightning added this to the 0.9.x milestone Sep 23, 2020

edenlightning assigned ananthsub Sep 23, 2020

ananthsub closed this as completed Oct 3, 2020

ananthsub mentioned this issue Oct 25, 2020

Bug/4319 ddp checkpoint #4323

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on_save_checkpoint callbacks runs in rank zero only #3545

on_save_checkpoint callbacks runs in rank zero only #3545

ananthsub commented Sep 18, 2020 •

edited

Loading

ananthsub commented Sep 18, 2020 •

edited

Loading

edenlightning commented Sep 18, 2020

ananthsub commented Oct 3, 2020

on_save_checkpoint callbacks runs in rank zero only #3545

on_save_checkpoint callbacks runs in rank zero only #3545

Comments

ananthsub commented Sep 18, 2020 • edited Loading

🐛 Bug

ananthsub commented Sep 18, 2020 • edited Loading

edenlightning commented Sep 18, 2020

ananthsub commented Oct 3, 2020

ananthsub commented Sep 18, 2020 •

edited

Loading

ananthsub commented Sep 18, 2020 •

edited

Loading