-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660
Comments
Hi! thanks for your contribution!, great first issue! |
Thanks. Seems related. Is there another mechanism to save checkpoints as a temporary fix? |
@ktrapeznikov mind checking it now as Google Colab had an internal issue with TPUs in past days... |
Last time I tried... same issue |
This would be a nice feature to have since I would rarely do inference on TPU, just training. |
@lezwon maybe you can help revive the tpu + checkpoint fix? |
Been working on it. Finding it a bit tricky. I'll raise a PR with the current work I've done. |
hi, i am experiencing being stuck on 0% at first epoch when using TPU with pytorch-lightning. any possible reasons why? |
@sarmientoj24 mind share the notebook? |
I am seeing training stuck at the last step of the 1st epoch when
checkpoint_callback=True
is enabled. I am passingtpu_cores=8
to the trainer.If it's
False
training is still slow for the first few steps and then it speeds up (I guess that this is expected since it takes a few steps for xla to compile stuff).Here is a link to the colab notebook example.
The text was updated successfully, but these errors were encountered: