Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when training on TPU #4464

Closed
FabianBell opened this issue Nov 1, 2020 · 7 comments
Closed

SIGSEGV when training on TPU #4464

FabianBell opened this issue Nov 1, 2020 · 7 comments
Labels
accelerator: tpu Tensor Processing Unit question Further information is requested

Comments

@FabianBell
Copy link

❓ Questions and Help

Before asking:

  1. Try to find answers to your questions in the Lightning Forum!
  2. Search for similar issues.
  3. Search the docs.

What is your question?

I tried to apply the reformer model on a sentiment analysis task and train it on a tpu. I get a

ProcessExitedException: process X terminated with signal SIGSEGV

What did I do wrong?

Code

You can find my code in a colab notebook here.

What have you tried?

I tried to stick to notebook 1 for the general setup and notebook 2 for the tpu setup. I saw #1956 and #2124 however it does not work with the latest version (1.0.4).

What's your environment?

  • OS: [e.g. iOS, Linux, Win] Linux
  • Packaging [e.g. pip, conda] pip
  • Version [e.g. 0.5.2.1] 1.0.4
@FabianBell FabianBell added the question Further information is requested label Nov 1, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2020

Hi! thanks for your contribution!, great first issue!

@rohitgr7 rohitgr7 added the accelerator: tpu Tensor Processing Unit label Nov 1, 2020
@rohitgr7
Copy link
Contributor

rohitgr7 commented Nov 1, 2020

cc @lezwon

@lezwon
Copy link
Contributor

lezwon commented Nov 4, 2020

This seems to be an XLA issue and is tracked here pytorch/xla#1775

@FabianBell Mind adding the following at the beginning of your notebook and trying?

import os
os.environ['XLA_USE_32BIT_LONG'] = '1'
os.environ['TRIM_GRAPH_SIZE'] = '1000000'

@FabianBell
Copy link
Author

@lezwon thank you for your help.

I changed the notebook but I still get the same error.

@lezwon
Copy link
Contributor

lezwon commented Nov 5, 2020

I get a different error: Notebook

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got XLAIntType instead (while checking arguments for embedding)

Will look into this.

@lezwon
Copy link
Contributor

lezwon commented Nov 9, 2020

@FabianBell I'm not able to figure out the root cause of the error mentioned above. I think it might be similar to this one: huggingface/transformers#2952

@FabianBell
Copy link
Author

@lezwon thank you for your help. I followed notebook and I ended up with the same error. I do not think that it is a pytorch lightning problem. I will therefore close this issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants