auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

patil-suraj · 2020-04-03T14:50:59Z

I trained a model on TPU with google colab. When trying to load the checkpoint on GPU it gives following error

RuntimeError: Could not run 'aten::empty.memory_format' with arguments from the 'XLATensorId' backend. 'aten::empty.memory_format' is only available for these backends: [CUDATensorId, SparseCPUTensorId, VariableTensorId, CPUTensorId, MkldnnCPUTensorId, SparseCUDATensorId].

How to load the checkpoint saved on TPU with cpu/gpu ?

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-04-04T12:22:19Z

ummm great point. @dlibenzi any docs on this?

dlibenzi · 2020-04-04T14:37:34Z

The recommended way to save checkpoint is to use the xm.save() API.
This is a thin wrapper around torch.save() that makes sure there is a symmetric graph execution on all cores:

https://github.com/pytorch/xla/blob/3518650335dba07820df8d5f6a33d1a969f78265/torch_xla/core/xla_model.py#L521

Saved tensors are PyTorch CPU tensor.
The load happen using vanilla torch.load() which sucks in PyTorch CPU tensor, then create your normal model, set the state dictionary, and then model.to(xla_device).

williamFalcon · 2020-04-04T14:42:20Z

i thought we had automated this.
this should be embedded in the auto checkpoint feature and load from checkpoint.

let’s turn this PR into those feature requests

Borda · 2020-07-27T20:48:31Z

will be fixed in #2726

stale · 2020-11-03T17:04:19Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

patil-suraj added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2020

williamFalcon changed the title ~~Can't load tpu checkpoint on GPU~~ auto enable TPU checkpoint save and checkpoint load using the proper wrappers Apr 4, 2020

williamFalcon added the feature Is an improvement or enhancement label Apr 4, 2020

williamFalcon added this to the 0.7.2 milestone Apr 4, 2020

Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020

Borda added the accelerator: tpu Tensor Processing Unit label Apr 9, 2020

Borda modified the milestones: 0.7.4, 0.7.5 Apr 26, 2020

Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 12, 2020

Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020

Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020

williamFalcon added priority: 0 High priority task and removed bug Something isn't working labels Jun 26, 2020

williamFalcon self-assigned this Jun 26, 2020

edenlightning modified the milestones: 0.9.0, 0.9.x Aug 18, 2020

edenlightning unassigned williamFalcon Sep 17, 2020

edenlightning removed the help wanted Open to be worked on label Sep 17, 2020

edenlightning linked a pull request Sep 17, 2020 that will close this issue

[WIP] fix checkpointing on TPU #2726

Closed

7 tasks

edenlightning modified the milestones: 0.9.x, 1.1 Sep 23, 2020

stale bot added the won't fix This will not be worked on label Nov 3, 2020

stale bot closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

patil-suraj commented Apr 3, 2020

williamFalcon commented Apr 4, 2020

dlibenzi commented Apr 4, 2020

williamFalcon commented Apr 4, 2020

Borda commented Jul 27, 2020

stale bot commented Nov 3, 2020

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

auto enable TPU checkpoint save and checkpoint load using the proper wrappers #1363

Comments

patil-suraj commented Apr 3, 2020

williamFalcon commented Apr 4, 2020

dlibenzi commented Apr 4, 2020

williamFalcon commented Apr 4, 2020

Borda commented Jul 27, 2020

stale bot commented Nov 3, 2020