[WIP] fix checkpointing on TPU #2726

matt-peters · 2020-07-27T16:44:44Z

What does this PR do?

Draft PR: fixes checkpointing on TPU, see #2700

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

Borda

I don't think that it fix the issue, let's wait for fixing TPU tests #2632

Borda · 2020-07-27T17:18:13Z

pytorch_lightning/callbacks/model_checkpoint.py

    def on_validation_end(self, trainer, pl_module):
-        # only run on main process
-        if trainer.global_rank != 0:


This may break DDp

Borda · 2020-07-27T17:22:18Z

pytorch_lightning/trainer/training_io.py

@@ -261,22 +261,36 @@ def _atomic_save(self, checkpoint, filepath: str):
                This points to the file that the checkpoint will be stored in.
        """
        tmp_path = str(filepath) + ".part"
-        torch.save(checkpoint, tmp_path)
-        os.replace(tmp_path, filepath)
+        if self.use_tpu:


There is on_tpu

Borda · 2020-07-27T17:26:17Z

pytorch_lightning/trainer/training_io.py

+        # non-XLA.  In XLA, it has a barrier and internal logic to only
+        # save for rank==0, so need to call for all ranks. For non-XLA,
+        # it doesn't have rank==0 logic so only call for rank==0
+        if self.use_tpu:


Why not or?

lezwon · 2020-07-27T18:06:50Z

@matt-peters I think we will require an on_tpu check here too.

ibeltagy · 2020-07-27T18:11:09Z

pytorch_lightning/callbacks/model_checkpoint.py

-            return
-
+        # To get checkpointing working on TPU, need to call _save_model
+        # for all ranks, to avoid deadlocks.  Assuming save_function is mapped


nit: without if trainer.global_rank != 0: and @rank_zero_only, all threads are writing to the log making it a little messy.

lezwon · 2020-07-28T01:28:13Z

pytorch_lightning/trainer/training_io.py

-        torch.save(checkpoint, tmp_path)
-        os.replace(tmp_path, filepath)
+        if self.use_tpu:
+            xm.save(checkpoint, tmp_path, master_only=True, global_master=True)


Maybe add a barrier before xm.save, to make sure all processes are in sync?

the barrier is already inside xm.save here

Excuse my naiveness, but shouldn't there be one before save, to make sure that the weights have been updated by every process?

matt-peters · 2020-07-28T03:10:41Z

Thanks for the reviews and comments. I don't have the bandwidth to shepherd this to completion, this PR was meant to outline a fix, not be ready to merge. I also haven't actually run the code as I'm working off a different release and this was patched while resolving conflicts so caveat emptor...

Borda · 2020-07-28T06:45:37Z

@matt-peters mind allow editing this pr b maintainers?

matt-peters · 2020-07-28T16:32:52Z

Edits are now allowed.

mergify · 2020-07-31T09:19:58Z

This pull request is now in conflict... :(

Borda · 2020-08-07T09:03:42Z

@matt-peters what is missing, is it still wip?

matt-peters · 2020-08-10T18:09:12Z

LGTM

Borda · 2020-08-10T18:36:13Z

it seems that the error comes from Horovod, even this PR shall not touch it... @tgaddair ?

tgaddair · 2020-08-10T19:58:28Z

it seems that the error comes from Horovod, even this PR shall not touch it... @tgaddair ?

I'm not sure they are Horovod specific, looks like there are similar failures for DDP.

mergify · 2020-08-12T10:32:56Z

This pull request is now in conflict... :(

edenlightning · 2020-09-17T19:42:08Z

we finished our refactoring. @matt-peters can you please rebase this change? we would love to get this supported!!

Borda · 2020-09-21T15:16:20Z

@matt-peters how is it going here? mind finish it?

matt-peters · 2020-09-21T18:16:05Z

It looks like pytorch_lightning/trainer/training_io.py was removed from PyTorchLightning:master at some point, and the functionality presumably moved elsewhere. As most of my changes were in that file, I don't have the bandwidth to track them down, port them over, and test it out. With write access you should be resolve the conflicts and push to this branch.

williamFalcon · 2020-09-21T19:19:05Z

i can look at it :)

Borda · 2020-09-25T12:03:52Z

@matt-peters how about this one, is it still WIP or ready to review? 🐰
if so, mind rebase on master and resolve conflicts...

Borda · 2020-10-02T14:40:36Z

@matt-peters maybe after the refactoring it would be easier to reset this to master and make the changes again...
you can just:

git reset upstream/master
# do your changes here
git commit ...
git push -f

stale · 2020-11-03T08:32:03Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2020-11-08T09:51:58Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

Borda · 2020-11-08T10:14:46Z

@matt-peters @lezwon can we finish this one?

lezwon · 2020-11-08T11:46:09Z

@Borda Have raised a new PR for this issue here: #4309

mergify bot requested a review from a team July 27, 2020 16:45

matt-peters mentioned this pull request Jul 27, 2020

Checkpointing is broken on TPUs #2700

Closed

Borda requested changes Jul 27, 2020

View reviewed changes

mergify bot requested a review from a team July 27, 2020 17:29

ibeltagy reviewed Jul 27, 2020

View reviewed changes

Borda changed the title ~~fix checkpointing on TPU~~ [blocked by #2432] fix checkpointing on TPU Jul 27, 2020

Borda mentioned this pull request Jul 27, 2020

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

Closed

lezwon reviewed Jul 28, 2020

View reviewed changes

Borda changed the title ~~[blocked by #2432] fix checkpointing on TPU~~ fix checkpointing on TPU Jul 28, 2020

Borda added the bug Something isn't working label Jul 28, 2020

williamFalcon changed the title ~~fix checkpointing on TPU~~ [WIP] fix checkpointing on TPU Jul 28, 2020

Borda added this to the 0.9.0 milestone Aug 6, 2020

Borda force-pushed the mp/xm_save branch from ad0c2da to fdac289 Compare August 7, 2020 09:10

fix checkpointing on TPU

9d90fa4

Borda force-pushed the mp/xm_save branch from fdac289 to 9d90fa4 Compare August 10, 2020 18:28

lezwon mentioned this pull request Aug 15, 2020

[TPU-Colab] RuntimeError: Cannot replicate if number of devices (1) is different from 8 #1703

Closed

edenlightning modified the milestones: 0.9.0, 0.9.x Aug 20, 2020

edenlightning linked an issue Sep 17, 2020 that may be closed by this pull request

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

Closed

awaelchli mentioned this pull request Sep 25, 2020

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

Closed

Borda added checkpointing Related to checkpointing accelerator: tpu Tensor Processing Unit labels Oct 2, 2020

edenlightning modified the milestones: 0.9.x, 1.0, 1.1 Oct 4, 2020

Borda modified the milestones: 1.1, 1.0.x Oct 20, 2020

stale bot added the won't fix This will not be worked on label Nov 3, 2020

stale bot closed this Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix checkpointing on TPU #2726

[WIP] fix checkpointing on TPU #2726

matt-peters commented Jul 27, 2020

Borda left a comment •

edited

Loading

Borda Jul 27, 2020

Borda Jul 27, 2020

Borda Jul 27, 2020

lezwon commented Jul 27, 2020

ibeltagy Jul 27, 2020 •

edited

Loading

lezwon Jul 28, 2020

ibeltagy Jul 28, 2020

lezwon Jul 28, 2020

matt-peters commented Jul 28, 2020

Borda commented Jul 28, 2020

matt-peters commented Jul 28, 2020

mergify bot commented Jul 31, 2020

Borda commented Aug 7, 2020

matt-peters commented Aug 10, 2020

Borda commented Aug 10, 2020

tgaddair commented Aug 10, 2020

mergify bot commented Aug 12, 2020

edenlightning commented Sep 17, 2020

Borda commented Sep 21, 2020

matt-peters commented Sep 21, 2020

williamFalcon commented Sep 21, 2020

Borda commented Sep 25, 2020

Borda commented Oct 2, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 8, 2020

Borda commented Nov 8, 2020

lezwon commented Nov 8, 2020

[WIP] fix checkpointing on TPU #2726

[WIP] fix checkpointing on TPU #2726

Conversation

matt-peters commented Jul 27, 2020

What does this PR do?

Borda left a comment • edited Loading

Choose a reason for hiding this comment

Borda Jul 27, 2020

Choose a reason for hiding this comment

Borda Jul 27, 2020

Choose a reason for hiding this comment

Borda Jul 27, 2020

Choose a reason for hiding this comment

lezwon commented Jul 27, 2020

ibeltagy Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

lezwon Jul 28, 2020

Choose a reason for hiding this comment

ibeltagy Jul 28, 2020

Choose a reason for hiding this comment

lezwon Jul 28, 2020

Choose a reason for hiding this comment

matt-peters commented Jul 28, 2020

Borda commented Jul 28, 2020

matt-peters commented Jul 28, 2020

mergify bot commented Jul 31, 2020

Borda commented Aug 7, 2020

matt-peters commented Aug 10, 2020

Borda commented Aug 10, 2020

tgaddair commented Aug 10, 2020

mergify bot commented Aug 12, 2020

edenlightning commented Sep 17, 2020

Borda commented Sep 21, 2020

matt-peters commented Sep 21, 2020

williamFalcon commented Sep 21, 2020

Borda commented Sep 25, 2020

Borda commented Oct 2, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 8, 2020

Borda commented Nov 8, 2020

lezwon commented Nov 8, 2020

Borda left a comment •

edited

Loading

ibeltagy Jul 27, 2020 •

edited

Loading