-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminate called after throwing an instance of 'std::runtime_error #2338
Comments
Davide suspected just based on the error that maybe the main process was doing some kind of TPU work before calling spawn. Then the initial TPU state gets interrupted and leads to the error you see |
I tried running with an earlier version of pytorch/xla (June 30 build) and same error. This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8 It looks like the failing test was added in that commit, is that right? |
It looks like the failing test added in that commit is very similar to your old test, which I think is still passing. New test:
Old test:
Maybe the problem is in |
Another idea: does I think if it tried to kick off a TPU test while another is running you would see the session not found error. It might explain why the job fails at different points on GCP vs Colab; maybe the parallel test starts up at a different time on the different platforms. You could try disabling parallelism. Another good confirmation is just to disable the |
I already mentioned this to William ... their use of tpu_cores=tpu_cores, |
this test is now failing because it was never running before.... the pytest.mark.spawn for some reason was skipping the test.. To see this, run the test with @dlibenzi the tpu_cores argument can either be 1 or 8... i believe someone enabled indexing into a core to run on a specific core on kaggle but i haven't seen that work on colab.
Use an index of a core
|
@zcain117 Can you point me to the test code? |
This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8 The wrapper code for I'm wondering if the seed reset or logging setup in |
If a single test process, mixes single device and multi device, you need to wrap the test call in a multiprocessing Process. |
I looked through some of the recent TPU runs and did not see any more TPU session errors ever since you switched the code away from parameterized tests which were running both 1-core and 8-core versions of the same test. Some more context on 1-core vs 8-core tests:
You might have already fixed the problem since I don't see it anymore. If you do see it again, the safest way to mix 1-core and 8-core tests in the same file is to use Davide wrote up an idea for how this might look:
If you run into more |
We are adding that to our test support code: |
it seems that the test with specific index works but fails if we use only |
Ar you wrapping all test with the proper decorator? |
Looks like a different error, right?
I don't think this is related to the session not found error |
yes, it seems that your proposal with wrapping the test helped but another issue raised... |
in fact, the test was not checking if thee training finished correctly |
I am using this one: |
Yeah, this has nothing to do with our code.
The place that @zcain117 pointed out is one culprit. |
Seems to work here Doing this we can parallelize 8 Lightning modules each on a different core. Is this incorrect somehow? how does this work in kaggle but not here? |
The code is not valid. Please do not let me repeat this one more time 😄 |
Here is an example Kaggle notebook that is closer to our recommended usage pattern: https://www.kaggle.com/abhishek/i-like-clean-tpu-training-kernels-i-can-not-lie/notebook |
that’s what we do... what we are wondering is how @lezwon got this other version to work |
The implementation I made was based on this kernel: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel |
I think Abhishek's super-duper-fast kernel is manually spawning processes using You should instead use his Kaggle notebook here which uses The general flow for |
@zcain117 Both these approaches have been implemented in Lightning. If |
yes exactly. the point is that we can allow users to run on all 8 cores for 1 model. or 8 models each on a core |
I see, thanks for the clarification. That sounds like the right usage. However the error a few comments above shows this:
This implies to me that the code is giving a string argument to That string arg is coming from here. Maybe |
That's odd. I never have seen that error before. As per my knowledge, there's only one place where |
@zcain117 @lezwon lets move the discussion to Lightning-AI/pytorch-lightning#2632 as the error here is not valid anymore... |
🐛 Bug
We are running tests on Colab and GCP with TPU and they are failing which started a few days ago (not sure if it prure XLA issue or our bad usage) The two env yield in the very same error:
The output error is:
To Reproduce
Here is colab notebook https://colab.research.google.com/drive/1Gr1Wg4zVnu15WHE_-dU2YKr4Z5xsy-fL#scrollTo=Mx61q3X5bwoW
this is output from GCP https://github.com/PyTorchLightning/pytorch-lightning/runs/854754135?check_suite_focus=true
Steps to reproduce the behavior:
! git clone https://github.com/PyTorchLightning/pytorch-lightning.git
! pip install -r requirements/devel.txt -q -U
! curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
! python pytorch-xla-env-setup.py --version "20200708" #@param ["20200708","nightly", "xrt==1.15.0"]
! python -m pytest tests/models/test_tpu.py -v
Additional context
The text was updated successfully, but these errors were encountered: