Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing is broken on TPUs #2700

Closed
ibeltagy opened this issue Jul 25, 2020 · 7 comments · Fixed by #4309
Closed

Checkpointing is broken on TPUs #2700

ibeltagy opened this issue Jul 25, 2020 · 7 comments · Fixed by #4309
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@ibeltagy
Copy link
Contributor

ibeltagy commented Jul 25, 2020

🐛 Bug

Pytorch/XLA saves checkpoints using the following syntax which is not supported in pytorch-lightning.

import torch_xla.core.xla_model as xm
xm.save()

It is a little tricky to support because xm.save() has a barrier inside it and it checks for rank=0 while torch.save doesn't. This means torch.save should be called only on the process with rank=0 (which pytorch-lighting does) but xm.save() should be called by all processes (or it will wait forever at the barrier). This means pytorch-lightning code that checks for the rank (here will need to be switched off on TPUs.

To Reproduce

  1. Train any model on TPUs using PyTorch/XLA with ptl.Trainer(checkpoint_callback=[ModelCheckpoint(...)], num_tpu_cores=8)
  2. Wait until the model saves one checkpoint then kill the process
  3. Try to load the saved checkpoint with ptl.Trainer(resume_from_checkpoint='path_to_saved_checkpoint', num_tpu_cores=8)
  4. See error

Expected behavior

Loading checkpoint successfully.

Environment

pytorch-lightning==v0.8.5

Additional Context

Thanks to @matt-peters for finding the bug and suggesting the solution mention below.

@ibeltagy ibeltagy added bug Something isn't working help wanted Open to be worked on labels Jul 25, 2020
ibeltagy added a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 25, 2020
@Borda Borda added accelerator: tpu Tensor Processing Unit priority: 0 High priority task labels Jul 25, 2020
@Borda
Copy link
Member

Borda commented Jul 25, 2020

@lezwon mind have look?

@lezwon
Copy link
Contributor

lezwon commented Jul 25, 2020

sure :]

@ibeltagy
Copy link
Contributor Author

Thanks, @lezwon. You might want to check this fix here ibeltagy@a5c8d18 which works but I don't like it. I also tried calling the functions inside xm.save here in the main process only without the barrier but everything hangs, maybe because the processes go out of sync.

@lezwon
Copy link
Contributor

lezwon commented Jul 25, 2020

@ibeltagy Nice work :] I'll check out your solution and try and back with a fix on this.

@Borda Borda mentioned this issue Jul 25, 2020
7 tasks
@lezwon
Copy link
Contributor

lezwon commented Jul 26, 2020

@ibeltagy I am able to reload the checkpoint successfully, however, the training fails due to some xla device issue. Is it the same error you face? could you share a notebook reproducing this issue?

@matt-peters
Copy link

Loading the checkpoint fails only fails when loading without a TPU device available, as torch.save will write out XLA tensors instead of pytorch tensors. This a common workflow to train on TPU, but then move to a CPU or GPU for further processing. As a result, I don't think it's possible to reproduce with a notebook. xm.save moves everything to CPU before saving to avoid this problem. I opened a PR #2726 that includes a fix.

@Borda Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020
@edenlightning edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 13, 2020
@Borda Borda modified the milestones: 1.0.7, 1.0.x, 1.1 Nov 13, 2020
@coldfir3
Copy link

Is this still an issue? I am experiencing the exact same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
6 participants