Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

Closed
patil-suraj opened this issue Apr 3, 2020 · 5 comments
Labels
accelerator: tpu Tensor Processing Unit feature Is an improvement or enhancement priority: 0 High priority task won't fix This will not be worked on
Milestone

Comments

@patil-suraj
Copy link

I trained a model on TPU with google colab. When trying to load the checkpoint on GPU it gives following error

RuntimeError: Could not run 'aten::empty.memory_format' with arguments from the 'XLATensorId' backend. 'aten::empty.memory_format' is only available for these backends: [CUDATensorId, SparseCPUTensorId, VariableTensorId, CPUTensorId, MkldnnCPUTensorId, SparseCUDATensorId].

How to load the checkpoint saved on TPU with cpu/gpu ?

@patil-suraj patil-suraj added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2020
@williamFalcon
Copy link
Contributor

ummm great point. @dlibenzi any docs on this?

@dlibenzi
Copy link

dlibenzi commented Apr 4, 2020

The recommended way to save checkpoint is to use the xm.save() API.
This is a thin wrapper around torch.save() that makes sure there is a symmetric graph execution on all cores:

https://github.com/pytorch/xla/blob/3518650335dba07820df8d5f6a33d1a969f78265/torch_xla/core/xla_model.py#L521

Saved tensors are PyTorch CPU tensor.
The load happen using vanilla torch.load() which sucks in PyTorch CPU tensor, then create your normal model, set the state dictionary, and then model.to(xla_device).

@williamFalcon
Copy link
Contributor

i thought we had automated this.
this should be embedded in the auto checkpoint feature and load from checkpoint.

let’s turn this PR into those feature requests

@williamFalcon williamFalcon changed the title Can't load tpu checkpoint on GPU auto enable TPU checkpoint save and checkpoint load using the proper wrappers Apr 4, 2020
@williamFalcon williamFalcon added the feature Is an improvement or enhancement label Apr 4, 2020
@williamFalcon williamFalcon added this to the 0.7.2 milestone Apr 4, 2020
@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020
@Borda Borda added the accelerator: tpu Tensor Processing Unit label Apr 9, 2020
@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 26, 2020
@Borda Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 12, 2020
@Borda Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020
@Borda Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020
@williamFalcon williamFalcon added priority: 0 High priority task and removed bug Something isn't working labels Jun 26, 2020
@williamFalcon williamFalcon self-assigned this Jun 26, 2020
@Borda
Copy link
Member

Borda commented Jul 27, 2020

will be fixed in #2726

@edenlightning edenlightning modified the milestones: 0.9.0, 0.9.x Aug 18, 2020
@edenlightning edenlightning removed the help wanted Open to be worked on label Sep 17, 2020
@edenlightning edenlightning linked a pull request Sep 17, 2020 that will close this issue
7 tasks
@edenlightning edenlightning modified the milestones: 0.9.x, 1.1 Sep 23, 2020
@stale
Copy link

stale bot commented Nov 3, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 3, 2020
@stale stale bot closed this as completed Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit feature Is an improvement or enhancement priority: 0 High priority task won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants