Checkpointing is broken on TPUs #2700

ibeltagy · 2020-07-25T00:37:19Z

🐛 Bug

Pytorch/XLA saves checkpoints using the following syntax which is not supported in pytorch-lightning.

import torch_xla.core.xla_model as xm
xm.save()

It is a little tricky to support because xm.save() has a barrier inside it and it checks for rank=0 while torch.save doesn't. This means torch.save should be called only on the process with rank=0 (which pytorch-lighting does) but xm.save() should be called by all processes (or it will wait forever at the barrier). This means pytorch-lightning code that checks for the rank (here will need to be switched off on TPUs.

To Reproduce

Train any model on TPUs using PyTorch/XLA with ptl.Trainer(checkpoint_callback=[ModelCheckpoint(...)], num_tpu_cores=8)
Wait until the model saves one checkpoint then kill the process
Try to load the saved checkpoint with ptl.Trainer(resume_from_checkpoint='path_to_saved_checkpoint', num_tpu_cores=8)
See error

Expected behavior

Loading checkpoint successfully.

Environment

pytorch-lightning==v0.8.5

Additional Context

Thanks to @matt-peters for finding the bug and suggesting the solution mention below.

The text was updated successfully, but these errors were encountered:

Borda · 2020-07-25T12:10:01Z

@lezwon mind have look?

lezwon · 2020-07-25T14:24:21Z

sure :]

ibeltagy · 2020-07-25T15:33:26Z

Thanks, @lezwon. You might want to check this fix here ibeltagy@a5c8d18 which works but I don't like it. I also tried calling the functions inside xm.save here in the main process only without the barrier but everything hangs, maybe because the processes go out of sync.

lezwon · 2020-07-25T20:20:24Z

@ibeltagy Nice work :] I'll check out your solution and try and back with a fix on this.

lezwon · 2020-07-26T14:24:52Z

@ibeltagy I am able to reload the checkpoint successfully, however, the training fails due to some xla device issue. Is it the same error you face? could you share a notebook reproducing this issue?

matt-peters · 2020-07-27T16:49:01Z

Loading the checkpoint fails only fails when loading without a TPU device available, as torch.save will write out XLA tensors instead of pytorch tensors. This a common workflow to train on TPU, but then move to a CPU or GPU for further processing. As a result, I don't think it's possible to reproduce with a notebook. xm.save moves everything to CPU before saving to avoid this problem. I opened a PR #2726 that includes a fix.

coldfir3 · 2021-05-29T00:42:27Z

Is this still an issue? I am experiencing the exact same problem.

ibeltagy added bug Something isn't working help wanted Open to be worked on labels Jul 25, 2020

ibeltagy added a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 25, 2020

fix Lightning-AI#2700

a5c8d18

Borda added accelerator: tpu Tensor Processing Unit priority: 0 High priority task labels Jul 25, 2020

Borda mentioned this issue Jul 25, 2020

fixing TPU tests #2632

Merged

7 tasks

matt-peters mentioned this issue Jul 27, 2020

[WIP] fix checkpointing on TPU #2726

Closed

7 tasks

edenlightning assigned matt-peters Jul 29, 2020

lezwon mentioned this issue Aug 19, 2020

Use xm.save to save model on TPU #3044

Closed

7 tasks

edenlightning added this to the 0.9.x milestone Sep 1, 2020

awaelchli mentioned this issue Sep 25, 2020

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

Closed

edenlightning modified the milestones: 0.9.x, 1.0, 1.1 Oct 4, 2020

edenlightning modified the milestones: 1.1, 1.0.3 Oct 19, 2020

edenlightning unassigned matt-peters Oct 19, 2020

lezwon mentioned this issue Oct 22, 2020

Tpu save #4309

Merged

8 tasks

ibeltagy mentioned this issue Nov 3, 2020

Update Pytorch Lightning to stable release version allenai/longformer#130

Closed

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020

Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 13, 2020

Borda modified the milestones: 1.0.7, 1.0.x, 1.1 Nov 13, 2020

SeanNaren closed this as completed in #4309 Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing is broken on TPUs #2700

Checkpointing is broken on TPUs #2700

ibeltagy commented Jul 25, 2020 •

edited

Loading

Borda commented Jul 25, 2020

lezwon commented Jul 25, 2020

ibeltagy commented Jul 25, 2020

lezwon commented Jul 25, 2020

lezwon commented Jul 26, 2020

matt-peters commented Jul 27, 2020

coldfir3 commented May 29, 2021

Checkpointing is broken on TPUs #2700

Checkpointing is broken on TPUs #2700

Comments

ibeltagy commented Jul 25, 2020 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional Context

Borda commented Jul 25, 2020

lezwon commented Jul 25, 2020

ibeltagy commented Jul 25, 2020

lezwon commented Jul 25, 2020

lezwon commented Jul 26, 2020

matt-peters commented Jul 27, 2020

coldfir3 commented May 29, 2021

ibeltagy commented Jul 25, 2020 •

edited

Loading