terminate called after throwing an instance of 'std::runtime_error #2338

Borda · 2020-07-09T23:04:18Z

🐛 Bug

We are running tests on Colab and GCP with TPU and they are failing which started a few days ago (not sure if it prure XLA issue or our bad usage) The two env yield in the very same error:

Colab hangs on the 5th test and if I stop the cell it gives the error
GCP fails straight away...

The output error is:

terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1104 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Not found: From /job:tpu_worker/replica:0/task:0:
XRT memory handle not found: 347330053282930
	 [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
	xla::XrtComputationClient::HandleReleaser()
	xla::util::TriggeredTask::Runner()
	clone
*** End stack trace ***

To Reproduce

Here is colab notebook https://colab.research.google.com/drive/1Gr1Wg4zVnu15WHE_-dU2YKr4Z5xsy-fL#scrollTo=Mx61q3X5bwoW
this is output from GCP https://github.com/PyTorchLightning/pytorch-lightning/runs/854754135?check_suite_focus=true

Steps to reproduce the behavior:

! git clone https://github.com/PyTorchLightning/pytorch-lightning.git
! pip install -r requirements/devel.txt -q -U
! curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
! python pytorch-xla-env-setup.py --version "20200708" #@param ["20200708","nightly", "xrt==1.15.0"]
! python -m pytest tests/models/test_tpu.py -v

Additional context

The text was updated successfully, but these errors were encountered:

zcain117 · 2020-07-09T23:44:07Z

Davide suspected just based on the error that maybe the main process was doing some kind of TPU work before calling spawn. Then the initial TPU state gets interrupted and leads to the error you see

zcain117 · 2020-07-09T23:46:11Z

I tried running with an earlier version of pytorch/xla (June 30 build) and same error.

This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8

It looks like the failing test was added in that commit, is that right?

zcain117 · 2020-07-09T23:57:14Z

It looks like the failing test added in that commit is very similar to your old test, which I think is still passing.

New test:

def test_base_tpu_model(tmpdir, tpu_cores):
    """Make sure model trains on TPU."""
    trainer_options = dict(
        default_root_dir=tmpdir,
        progress_bar_refresh_rate=0,
        max_epochs=1,
        tpu_cores=tpu_cores,
        limit_train_batches=0.4,
        limit_val_batches=0.4
    )

    model = EvalModelTemplate()
    tpipes.run_model_test(trainer_options, model, on_gpu=False, with_hpc=False)

Old test:

def test_multi_core_tpu_model(tmpdir, tpu_cores):
    """Test if distributed TPU core training works"""
    model = EvalModelTemplate()
    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        train_percent_check=0.4,
        val_percent_check=0.2,
        tpu_cores=tpu_cores,
    )
    trainer.fit(model)
    assert trainer.tpu_id is None

Maybe the problem is in tpipes.run_model_test ?

zcain117 · 2020-07-10T14:44:29Z

Another idea: does tpipes introduce any kind attempt to run tests in parallel?

I think if it tried to kick off a TPU test while another is running you would see the session not found error. It might explain why the job fails at different points on GCP vs Colab; maybe the parallel test starts up at a different time on the different platforms.

You could try disabling parallelism.

Another good confirmation is just to disable the tpipes test altogether and see if the tests pass or not

dlibenzi · 2020-07-10T14:49:39Z

I already mentioned this to William ... their use of tpu_cores created issues in the past:

tpu_cores=tpu_cores,

williamFalcon · 2020-07-10T15:34:38Z

this test is now failing because it was never running before....

the pytest.mark.spawn for some reason was skipping the test.. To see this, run the test with assert False and it will still pass.

@dlibenzi the tpu_cores argument can either be 1 or 8... i believe someone enabled indexing into a core to run on a specific core on kaggle but i haven't seen that work on colab.

# use 1 or 8 cores
tpu_cores=1
tpu_cores=8

Use an index of a core

# use 1 core (indexed by 2, so the third core)
tpu_cores=[2]

dlibenzi · 2020-07-10T18:04:42Z

@zcain117 Can you point me to the test code?

zcain117 · 2020-07-10T18:12:47Z

@zcain117 Can you point me to the test code?

This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8
ctrl-f for test_tpu.py

The wrapper code for run_model_test lives here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/base/develop_pipelines.py#L37

I'm wondering if the seed reset or logging setup in run_model_test is causing issues? If I understand correctly @williamFalcon is saying that the code works for model training if running outside of the unit tests

zcain117 · 2020-07-10T18:18:02Z

And here is test_tpu.py: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_tpu.py

dlibenzi · 2020-07-10T18:31:33Z

If a single test process, mixes single device and multi device, you need to wrap the test call in a multiprocessing Process.
So that every test is isolated in a single environment.

zcain117 · 2020-07-10T23:13:23Z

I looked through some of the recent TPU runs and did not see any more TPU session errors ever since you switched the code away from parameterized tests which were running both 1-core and 8-core versions of the same test.

Some more context on 1-core vs 8-core tests:

1-core tests use an in-process computation client which cleans up tensor handles asynchronously
this means that if a 1-core test runs and finishes and then an 8-core test starts up, the 8-core test will reset the TPU
depending on timing, the async cleanup process will look for those handles to cleanup but might not find them since the TPU was reset already. This leads to XRT memory handle not found

You might have already fixed the problem since I don't see it anymore. If you do see it again, the safest way to mix 1-core and 8-core tests in the same file is to use multiprocessing.

Davide wrote up an idea for how this might look:

import multiprocessing
import unittest
def mp_decorator(func):
    def wrapper(*args, **kwargs):
        proc = multiprocessing.Process(target=func, args=args, kwargs=kwargs)
        proc.start()
        proc.join()
        args[0].assertEqual(proc.exitcode, 0)
    return wrapper
class WrappedTest(unittest.TestCase):
    @mp_decorator
    def test_bogus_good(self):
        ONE = 1
        TWO = 2
        self.assertTrue(ONE + 1 == TWO)
    @mp_decorator
    def test_bogus_bad(self):
        ONE = 1
        TWO = 2
        self.assertTrue(ONE == TWO)
if __name__ == '__main__':
    unittest.main()

If you run into more XRT session not found errors even without the parameterized tests, you might consider trying this wrapping approach

dlibenzi · 2020-07-11T17:03:58Z

We are adding that to our test support code:

https://github.com/pytorch/xla/pull/2343/files#diff-209c0d1744e33e5255e56e0d968e5413R19-R26

Borda · 2020-07-17T23:21:17Z

it seems that the test with specific index works but fails if we use only xm.xla_device() as device
Lightning-AI/pytorch-lightning#2632 (comment)

dlibenzi · 2020-07-17T23:22:47Z

Ar you wrapping all test with the proper decorator?

zcain117 · 2020-07-17T23:33:12Z

Looks like a different error, right?

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/distrib_parts.py", line 195, in tpu_train
    self._device = xm.xla_device(tpu_core_idx) if tpu_core_idx is not None else xm.xla_device()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 239, in xla_device
    torch_xla._XLAC._xla_set_default_device(device)
RuntimeError: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

I don't think this is related to the session not found error

zcain117 · 2020-07-17T23:37:47Z

here is where the error happens: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_parts.py#L196

Borda · 2020-07-17T23:51:29Z

I don't think this is related to the session not found error

yes, it seems that your proposal with wrapping the test helped but another issue raised...

Borda · 2020-07-17T23:53:05Z

Old test:

def test_multi_core_tpu_model(tmpdir, tpu_cores):
    """Test if distributed TPU core training works"""
    model = EvalModelTemplate()
    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        train_percent_check=0.4,
        val_percent_check=0.2,
        tpu_cores=tpu_cores,
    )
    trainer.fit(model)
    assert trainer.tpu_id is None

Maybe the problem is in tpipes.run_model_test ?

in fact, the test was not checking if thee training finished correctly

Borda · 2020-07-17T23:59:35Z

Ar you wrapping all test with the proper decorator?

I am using this one:
https://github.com/PyTorchLightning/pytorch-lightning/blob/69cbb627748445d7f0195ebf45bd83d20a7beff2/tests/base/develop_utils.py#L91-L114

dlibenzi · 2020-07-18T14:50:57Z

Yeah, this has nothing to do with our code.
Look at the string value which gets passed to

  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 239, in xla_device
    torch_xla._XLAC._xla_set_default_device(device)
RuntimeError: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

The place that @zcain117 pointed out is one culprit.
You cannot instantiate XLA devices like that in multi-processing. You should just call xm.xla_device() w/out arguments.
I have been saying this for a while, about Lightning handling of those explicit TPU IDs. That code is not valid.

williamFalcon · 2020-07-18T14:56:45Z

Seems to work here
https://www.kaggle.com/lezwon/parallel-kfold-training-on-tpu-using-pytorch-li...

Doing this we can parallelize 8 Lightning modules each on a different core. Is this incorrect somehow? how does this work in kaggle but not here?

dlibenzi · 2020-07-18T16:39:58Z

The code is not valid. Please do not let me repeat this one more time 😄
All the playing with TPU IDs done by Lightning is not valid.
If you need some "ID" for process identification reasons, use xm.get_ordinal().
But you cannot pass that to xm.xla_device(). xm.xla_device() returns the device that MP process has been assigned and you cannot change that.

zcain117 · 2020-07-21T19:00:09Z

Seems to work here
https://www.kaggle.com/lezwon/parallel-kfold-training-on-tpu-using-pytorch-li...

Doing this we can parallelize 8 Lightning modules each on a different core. Is this incorrect somehow? how does this work in kaggle but not here?

Here is an example Kaggle notebook that is closer to our recommended usage pattern: https://www.kaggle.com/abhishek/i-like-clean-tpu-training-kernels-i-can-not-lie/notebook

williamFalcon · 2020-07-21T19:03:13Z

that’s what we do...
the equivalent is
tpu_cores=8

what we are wondering is how @lezwon got this other version to work

lezwon · 2020-07-22T01:10:48Z

that’s what we do...
the equivalent is
tpu_cores=8

what we are wondering is how @lezwon got this other version to work

The implementation I made was based on this kernel: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel
Seemed to work. :]

zcain117 · 2020-07-22T16:08:44Z

I think Abhishek's super-duper-fast kernel is manually spawning processes using Parallel. That notebook imports import torch_xla.distributed.xla_multiprocessing as xmp but never uses it. Another issue is that it imports the parallel loader but never uses it: import torch_xla.distributed.parallel_loader as pl. I'm not 100% sure but it's possible that this notebook is just running 8 independent processes and each one is learning its own weights in isolation.

You should instead use his Kaggle notebook here which uses xmp.spawn https://www.kaggle.com/abhishek/i-like-clean-tpu-training-kernels-i-can-not-lie/notebook

The general flow for xmp.spawn is that you write a mp_fn, which is code that will be sent to each of the 8 TPU cores and each core will run that code independently but the xla_multiprocessing library will share gradients so that the processes are working together to make training progress. With that in mind, you can see it does not make sense to call xm.xla_device() with arguments to choose a device because the code is running on 1 core. Instead you just use xm.xla_device() to figure out which device your code is currently running on.

lezwon · 2020-07-22T16:18:37Z

@zcain117 Both these approaches have been implemented in Lightning. If tpu_cores is 1 or 8, xmp.spawn is used. If its [1], [2] ... [8], the training proceeds on an individual tpu core on the current process. xm.xla_device() is used when the training happens using xmp.spawn. When it is an individual process xm.xla_device(tpu_id) is used.

Ref: https://github.com/PyTorchLightning/pytorch-lightning/blob/a3934ad04b62686c8161158f66b5fcf3e56a3889/pytorch_lightning/trainer/trainer.py#L1026

williamFalcon · 2020-07-22T16:23:09Z

yes exactly.

the point is that we can allow users to run on all 8 cores for 1 model. or 8 models each on a core

zcain117 · 2020-07-22T16:42:31Z

I see, thanks for the clarification. That sounds like the right usage. However the error a few comments above shows this:

TPU available: True, using: 8 TPU cores
training on 8 TPU cores
Exception in device=TPU:0: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

This implies to me that the code is giving a string argument to xla_device() even in the 8-core case?

That string arg is coming from here. Maybe self.tpu_id is meant to be none for the 8-core case but it's not for some reason?

lezwon · 2020-07-22T16:53:05Z

That's odd. I never have seen that error before. As per my knowledge, there's only one place where tpu_id is assigned. And it seems fine: https://github.com/PyTorchLightning/pytorch-lightning/blob/6d10ac2ac84bc01f78eed381fd6152c869040b64/pytorch_lightning/trainer/trainer.py#L476

Borda · 2020-07-22T20:28:18Z

@zcain117 @lezwon lets move the discussion to Lightning-AI/pytorch-lightning#2632 as the error here is not valid anymore...

Borda mentioned this issue Jul 9, 2020

run full TPU pytests Lightning-AI/pytorch-lightning#2560

Closed

Borda closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminate called after throwing an instance of 'std::runtime_error #2338

terminate called after throwing an instance of 'std::runtime_error #2338

Borda commented Jul 9, 2020

zcain117 commented Jul 9, 2020

zcain117 commented Jul 9, 2020

zcain117 commented Jul 9, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 10, 2020

williamFalcon commented Jul 10, 2020 •

edited

Loading

dlibenzi commented Jul 10, 2020

zcain117 commented Jul 10, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 10, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 11, 2020

Borda commented Jul 17, 2020 •

edited

Loading

dlibenzi commented Jul 17, 2020

zcain117 commented Jul 17, 2020

zcain117 commented Jul 17, 2020

Borda commented Jul 17, 2020

Borda commented Jul 17, 2020

Borda commented Jul 17, 2020

dlibenzi commented Jul 18, 2020

williamFalcon commented Jul 18, 2020 •

edited

Loading

dlibenzi commented Jul 18, 2020

zcain117 commented Jul 21, 2020

williamFalcon commented Jul 21, 2020

lezwon commented Jul 22, 2020

zcain117 commented Jul 22, 2020

lezwon commented Jul 22, 2020 •

edited

Loading

williamFalcon commented Jul 22, 2020

zcain117 commented Jul 22, 2020

lezwon commented Jul 22, 2020

Borda commented Jul 22, 2020

terminate called after throwing an instance of 'std::runtime_error #2338

terminate called after throwing an instance of 'std::runtime_error #2338

Comments

Borda commented Jul 9, 2020

🐛 Bug

To Reproduce

Additional context

zcain117 commented Jul 9, 2020

zcain117 commented Jul 9, 2020

zcain117 commented Jul 9, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 10, 2020

williamFalcon commented Jul 10, 2020 • edited Loading

dlibenzi commented Jul 10, 2020

zcain117 commented Jul 10, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 10, 2020

zcain117 commented Jul 10, 2020

dlibenzi commented Jul 11, 2020

Borda commented Jul 17, 2020 • edited Loading

dlibenzi commented Jul 17, 2020

zcain117 commented Jul 17, 2020

zcain117 commented Jul 17, 2020

Borda commented Jul 17, 2020

Borda commented Jul 17, 2020

Borda commented Jul 17, 2020

dlibenzi commented Jul 18, 2020

williamFalcon commented Jul 18, 2020 • edited Loading

dlibenzi commented Jul 18, 2020

zcain117 commented Jul 21, 2020

williamFalcon commented Jul 21, 2020

lezwon commented Jul 22, 2020

zcain117 commented Jul 22, 2020

lezwon commented Jul 22, 2020 • edited Loading

williamFalcon commented Jul 22, 2020

zcain117 commented Jul 22, 2020

lezwon commented Jul 22, 2020

Borda commented Jul 22, 2020

williamFalcon commented Jul 10, 2020 •

edited

Loading

Borda commented Jul 17, 2020 •

edited

Loading

williamFalcon commented Jul 18, 2020 •

edited

Loading

lezwon commented Jul 22, 2020 •

edited

Loading