Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix checkpointing on TPU #2726

Closed
wants to merge 1 commit into from

Conversation

matt-peters
Copy link

What does this PR do?

Draft PR: fixes checkpointing on TPU, see #2700

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

@mergify mergify bot requested a review from a team July 27, 2020 16:45
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it fix the issue, let's wait for fixing TPU tests #2632

def on_validation_end(self, trainer, pl_module):
# only run on main process
if trainer.global_rank != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break DDp

@@ -261,22 +261,36 @@ def _atomic_save(self, checkpoint, filepath: str):
This points to the file that the checkpoint will be stored in.
"""
tmp_path = str(filepath) + ".part"
torch.save(checkpoint, tmp_path)
os.replace(tmp_path, filepath)
if self.use_tpu:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is on_tpu

# non-XLA. In XLA, it has a barrier and internal logic to only
# save for rank==0, so need to call for all ranks. For non-XLA,
# it doesn't have rank==0 logic so only call for rank==0
if self.use_tpu:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not or?

@mergify mergify bot requested a review from a team July 27, 2020 17:29
@lezwon
Copy link
Contributor

lezwon commented Jul 27, 2020

@matt-peters I think we will require an on_tpu check here too.

return

# To get checkpointing working on TPU, need to call _save_model
# for all ranks, to avoid deadlocks. Assuming save_function is mapped
Copy link
Contributor

@ibeltagy ibeltagy Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: without if trainer.global_rank != 0: and @rank_zero_only, all threads are writing to the log making it a little messy.

@Borda Borda changed the title fix checkpointing on TPU [blocked by #2432] fix checkpointing on TPU Jul 27, 2020
torch.save(checkpoint, tmp_path)
os.replace(tmp_path, filepath)
if self.use_tpu:
xm.save(checkpoint, tmp_path, master_only=True, global_master=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a barrier before xm.save, to make sure all processes are in sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the barrier is already inside xm.save here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse my naiveness, but shouldn't there be one before save, to make sure that the weights have been updated by every process?

@matt-peters
Copy link
Author

Thanks for the reviews and comments. I don't have the bandwidth to shepherd this to completion, this PR was meant to outline a fix, not be ready to merge. I also haven't actually run the code as I'm working off a different release and this was patched while resolving conflicts so caveat emptor...

@Borda Borda changed the title [blocked by #2432] fix checkpointing on TPU fix checkpointing on TPU Jul 28, 2020
@Borda Borda added the bug Something isn't working label Jul 28, 2020
@Borda
Copy link
Member

Borda commented Jul 28, 2020

@matt-peters mind allow editing this pr b maintainers?

@matt-peters
Copy link
Author

Edits are now allowed.

@williamFalcon williamFalcon changed the title fix checkpointing on TPU [WIP] fix checkpointing on TPU Jul 28, 2020
@mergify
Copy link
Contributor

mergify bot commented Jul 31, 2020

This pull request is now in conflict... :(

@Borda Borda added this to the 0.9.0 milestone Aug 6, 2020
@Borda
Copy link
Member

Borda commented Aug 7, 2020

@matt-peters what is missing, is it still wip?

@matt-peters
Copy link
Author

LGTM

@Borda
Copy link
Member

Borda commented Aug 10, 2020

it seems that the error comes from Horovod, even this PR shall not touch it... @tgaddair ?

@tgaddair
Copy link
Contributor

it seems that the error comes from Horovod, even this PR shall not touch it... @tgaddair ?

I'm not sure they are Horovod specific, looks like there are similar failures for DDP.

@mergify
Copy link
Contributor

mergify bot commented Aug 12, 2020

This pull request is now in conflict... :(

@edenlightning
Copy link
Contributor

we finished our refactoring. @matt-peters can you please rebase this change? we would love to get this supported!!

@Borda
Copy link
Member

Borda commented Sep 21, 2020

@matt-peters how is it going here? mind finish it?

@matt-peters
Copy link
Author

It looks like pytorch_lightning/trainer/training_io.py was removed from PyTorchLightning:master at some point, and the functionality presumably moved elsewhere. As most of my changes were in that file, I don't have the bandwidth to track them down, port them over, and test it out. With write access you should be resolve the conflicts and push to this branch.

@williamFalcon
Copy link
Contributor

i can look at it :)

@Borda
Copy link
Member

Borda commented Sep 25, 2020

@matt-peters how about this one, is it still WIP or ready to review? 🐰
if so, mind rebase on master and resolve conflicts...

@Borda
Copy link
Member

Borda commented Oct 2, 2020

@matt-peters maybe after the refactoring it would be easier to reset this to master and make the changes again...
you can just:

git reset upstream/master
# do your changes here
git commit ...
git push -f

@edenlightning edenlightning modified the milestones: 0.9.x, 1.0, 1.1 Oct 4, 2020
@Borda Borda modified the milestones: 1.1, 1.0.x Oct 20, 2020
@stale
Copy link

stale bot commented Nov 3, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 3, 2020
@stale
Copy link

stale bot commented Nov 8, 2020

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Nov 8, 2020
@Borda
Copy link
Member

Borda commented Nov 8, 2020

@matt-peters @lezwon can we finish this one?

@lezwon
Copy link
Contributor

lezwon commented Nov 8, 2020

@Borda Have raised a new PR for this issue here: #4309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working checkpointing Related to checkpointing won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auto enable TPU checkpoint save and checkpoint load using the proper wrappers
7 participants