-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fix checkpointing on TPU #2726
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it fix the issue, let's wait for fixing TPU tests #2632
def on_validation_end(self, trainer, pl_module): | ||
# only run on main process | ||
if trainer.global_rank != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break DDp
@@ -261,22 +261,36 @@ def _atomic_save(self, checkpoint, filepath: str): | |||
This points to the file that the checkpoint will be stored in. | |||
""" | |||
tmp_path = str(filepath) + ".part" | |||
torch.save(checkpoint, tmp_path) | |||
os.replace(tmp_path, filepath) | |||
if self.use_tpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is on_tpu
# non-XLA. In XLA, it has a barrier and internal logic to only | ||
# save for rank==0, so need to call for all ranks. For non-XLA, | ||
# it doesn't have rank==0 logic so only call for rank==0 | ||
if self.use_tpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not or?
@matt-peters I think we will require an |
return | ||
|
||
# To get checkpointing working on TPU, need to call _save_model | ||
# for all ranks, to avoid deadlocks. Assuming save_function is mapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: without if trainer.global_rank != 0:
and @rank_zero_only
, all threads are writing to the log making it a little messy.
torch.save(checkpoint, tmp_path) | ||
os.replace(tmp_path, filepath) | ||
if self.use_tpu: | ||
xm.save(checkpoint, tmp_path, master_only=True, global_master=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a barrier before xm.save
, to make sure all processes are in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the barrier is already inside xm.save
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse my naiveness, but shouldn't there be one before save, to make sure that the weights have been updated by every process?
Thanks for the reviews and comments. I don't have the bandwidth to shepherd this to completion, this PR was meant to outline a fix, not be ready to merge. I also haven't actually run the code as I'm working off a different release and this was patched while resolving conflicts so caveat emptor... |
@matt-peters mind allow editing this pr b maintainers? |
Edits are now allowed. |
This pull request is now in conflict... :( |
@matt-peters what is missing, is it still wip? |
LGTM |
it seems that the error comes from Horovod, even this PR shall not touch it... @tgaddair ? |
I'm not sure they are Horovod specific, looks like there are similar failures for DDP. |
This pull request is now in conflict... :( |
we finished our refactoring. @matt-peters can you please rebase this change? we would love to get this supported!! |
@matt-peters how is it going here? mind finish it? |
It looks like |
i can look at it :) |
@matt-peters how about this one, is it still WIP or ready to review? 🐰 |
@matt-peters maybe after the refactoring it would be easier to reset this to master and make the changes again...
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
@matt-peters @lezwon can we finish this one? |
What does this PR do?
Draft PR: fixes checkpointing on TPU, see #2700