Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'std::runtime_error #2338

Closed
Borda opened this issue Jul 9, 2020 · 31 comments
Closed

terminate called after throwing an instance of 'std::runtime_error #2338

Borda opened this issue Jul 9, 2020 · 31 comments

Comments

@Borda
Copy link

Borda commented Jul 9, 2020

🐛 Bug

We are running tests on Colab and GCP with TPU and they are failing which started a few days ago (not sure if it prure XLA issue or our bad usage) The two env yield in the very same error:

  • Colab hangs on the 5th test and if I stop the cell it gives the error
  • GCP fails straight away...

The output error is:

terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1104 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Not found: From /job:tpu_worker/replica:0/task:0:
XRT memory handle not found: 347330053282930
	 [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
	xla::XrtComputationClient::HandleReleaser()
	xla::util::TriggeredTask::Runner()
	clone
*** End stack trace ***

To Reproduce

Here is colab notebook https://colab.research.google.com/drive/1Gr1Wg4zVnu15WHE_-dU2YKr4Z5xsy-fL#scrollTo=Mx61q3X5bwoW
this is output from GCP https://github.com/PyTorchLightning/pytorch-lightning/runs/854754135?check_suite_focus=true

Steps to reproduce the behavior:

  1. ! git clone https://github.com/PyTorchLightning/pytorch-lightning.git
  2. ! pip install -r requirements/devel.txt -q -U
  3. ! curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
  4. ! python pytorch-xla-env-setup.py --version "20200708" #@param ["20200708","nightly", "xrt==1.15.0"]
  5. ! python -m pytest tests/models/test_tpu.py -v

Additional context

@zcain117
Copy link
Collaborator

zcain117 commented Jul 9, 2020

Davide suspected just based on the error that maybe the main process was doing some kind of TPU work before calling spawn. Then the initial TPU state gets interrupted and leads to the error you see

@zcain117
Copy link
Collaborator

zcain117 commented Jul 9, 2020

I tried running with an earlier version of pytorch/xla (June 30 build) and same error.

This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8

It looks like the failing test was added in that commit, is that right?

@zcain117
Copy link
Collaborator

zcain117 commented Jul 9, 2020

It looks like the failing test added in that commit is very similar to your old test, which I think is still passing.

New test:

def test_base_tpu_model(tmpdir, tpu_cores):
    """Make sure model trains on TPU."""
    trainer_options = dict(
        default_root_dir=tmpdir,
        progress_bar_refresh_rate=0,
        max_epochs=1,
        tpu_cores=tpu_cores,
        limit_train_batches=0.4,
        limit_val_batches=0.4
    )

    model = EvalModelTemplate()
    tpipes.run_model_test(trainer_options, model, on_gpu=False, with_hpc=False)

Old test:

def test_multi_core_tpu_model(tmpdir, tpu_cores):
    """Test if distributed TPU core training works"""
    model = EvalModelTemplate()
    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        train_percent_check=0.4,
        val_percent_check=0.2,
        tpu_cores=tpu_cores,
    )
    trainer.fit(model)
    assert trainer.tpu_id is None

Maybe the problem is in tpipes.run_model_test ?

@zcain117
Copy link
Collaborator

Another idea: does tpipes introduce any kind attempt to run tests in parallel?

I think if it tried to kick off a TPU test while another is running you would see the session not found error. It might explain why the job fails at different points on GCP vs Colab; maybe the parallel test starts up at a different time on the different platforms.

You could try disabling parallelism.

Another good confirmation is just to disable the tpipes test altogether and see if the tests pass or not

@dlibenzi
Copy link
Collaborator

I already mentioned this to William ... their use of tpu_cores created issues in the past:

tpu_cores=tpu_cores,

@williamFalcon
Copy link

williamFalcon commented Jul 10, 2020

this test is now failing because it was never running before....

the pytest.mark.spawn for some reason was skipping the test.. To see this, run the test with assert False and it will still pass.

@dlibenzi the tpu_cores argument can either be 1 or 8... i believe someone enabled indexing into a core to run on a specific core on kaggle but i haven't seen that work on colab.

# use 1 or 8 cores
tpu_cores=1
tpu_cores=8

Use an index of a core

# use 1 core (indexed by 2, so the third core)
tpu_cores=[2]

@dlibenzi
Copy link
Collaborator

@zcain117 Can you point me to the test code?

@zcain117
Copy link
Collaborator

@zcain117 Can you point me to the test code?

This is the commit where the tests started failing: Lightning-AI/pytorch-lightning@11069c8
ctrl-f for test_tpu.py

The wrapper code for run_model_test lives here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/base/develop_pipelines.py#L37

I'm wondering if the seed reset or logging setup in run_model_test is causing issues? If I understand correctly @williamFalcon is saying that the code works for model training if running outside of the unit tests

@zcain117
Copy link
Collaborator

@dlibenzi
Copy link
Collaborator

If a single test process, mixes single device and multi device, you need to wrap the test call in a multiprocessing Process.
So that every test is isolated in a single environment.

@zcain117
Copy link
Collaborator

I looked through some of the recent TPU runs and did not see any more TPU session errors ever since you switched the code away from parameterized tests which were running both 1-core and 8-core versions of the same test.

Some more context on 1-core vs 8-core tests:

  • 1-core tests use an in-process computation client which cleans up tensor handles asynchronously
  • this means that if a 1-core test runs and finishes and then an 8-core test starts up, the 8-core test will reset the TPU
  • depending on timing, the async cleanup process will look for those handles to cleanup but might not find them since the TPU was reset already. This leads to XRT memory handle not found

You might have already fixed the problem since I don't see it anymore. If you do see it again, the safest way to mix 1-core and 8-core tests in the same file is to use multiprocessing.

Davide wrote up an idea for how this might look:

import multiprocessing
import unittest
def mp_decorator(func):
    def wrapper(*args, **kwargs):
        proc = multiprocessing.Process(target=func, args=args, kwargs=kwargs)
        proc.start()
        proc.join()
        args[0].assertEqual(proc.exitcode, 0)
    return wrapper
class WrappedTest(unittest.TestCase):
    @mp_decorator
    def test_bogus_good(self):
        ONE = 1
        TWO = 2
        self.assertTrue(ONE + 1 == TWO)
    @mp_decorator
    def test_bogus_bad(self):
        ONE = 1
        TWO = 2
        self.assertTrue(ONE == TWO)
if __name__ == '__main__':
    unittest.main()

If you run into more XRT session not found errors even without the parameterized tests, you might consider trying this wrapping approach

@dlibenzi
Copy link
Collaborator

@Borda
Copy link
Author

Borda commented Jul 17, 2020

it seems that the test with specific index works but fails if we use only xm.xla_device() as device
Lightning-AI/pytorch-lightning#2632 (comment)

@dlibenzi
Copy link
Collaborator

Ar you wrapping all test with the proper decorator?

@zcain117
Copy link
Collaborator

Looks like a different error, right?

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/distrib_parts.py", line 195, in tpu_train
    self._device = xm.xla_device(tpu_core_idx) if tpu_core_idx is not None else xm.xla_device()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 239, in xla_device
    torch_xla._XLAC._xla_set_default_device(device)
RuntimeError: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

I don't think this is related to the session not found error

@zcain117
Copy link
Collaborator

@Borda
Copy link
Author

Borda commented Jul 17, 2020

I don't think this is related to the session not found error

yes, it seems that your proposal with wrapping the test helped but another issue raised...

@Borda
Copy link
Author

Borda commented Jul 17, 2020

Old test:

def test_multi_core_tpu_model(tmpdir, tpu_cores):
    """Test if distributed TPU core training works"""
    model = EvalModelTemplate()
    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        train_percent_check=0.4,
        val_percent_check=0.2,
        tpu_cores=tpu_cores,
    )
    trainer.fit(model)
    assert trainer.tpu_id is None

Maybe the problem is in tpipes.run_model_test ?

in fact, the test was not checking if thee training finished correctly

@Borda
Copy link
Author

Borda commented Jul 17, 2020

@dlibenzi
Copy link
Collaborator

Yeah, this has nothing to do with our code.
Look at the string value which gets passed to

  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 239, in xla_device
    torch_xla._XLAC._xla_set_default_device(device)
RuntimeError: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

The place that @zcain117 pointed out is one culprit.
You cannot instantiate XLA devices like that in multi-processing. You should just call xm.xla_device() w/out arguments.
I have been saying this for a while, about Lightning handling of those explicit TPU IDs. That code is not valid.

@williamFalcon
Copy link

williamFalcon commented Jul 18, 2020

Seems to work here
https://www.kaggle.com/lezwon/parallel-kfold-training-on-tpu-using-pytorch-li...

Doing this we can parallelize 8 Lightning modules each on a different core. Is this incorrect somehow? how does this work in kaggle but not here?

@dlibenzi
Copy link
Collaborator

The code is not valid. Please do not let me repeat this one more time 😄
All the playing with TPU IDs done by Lightning is not valid.
If you need some "ID" for process identification reasons, use xm.get_ordinal().
But you cannot pass that to xm.xla_device(). xm.xla_device() returns the device that MP process has been assigned and you cannot change that.

@Borda Borda closed this as completed Jul 21, 2020
@zcain117
Copy link
Collaborator

Seems to work here
https://www.kaggle.com/lezwon/parallel-kfold-training-on-tpu-using-pytorch-li...

Doing this we can parallelize 8 Lightning modules each on a different core. Is this incorrect somehow? how does this work in kaggle but not here?

Here is an example Kaggle notebook that is closer to our recommended usage pattern: https://www.kaggle.com/abhishek/i-like-clean-tpu-training-kernels-i-can-not-lie/notebook

@williamFalcon
Copy link

that’s what we do...
the equivalent is
tpu_cores=8

what we are wondering is how @lezwon got this other version to work

@lezwon
Copy link
Contributor

lezwon commented Jul 22, 2020

that’s what we do...
the equivalent is
tpu_cores=8

what we are wondering is how @lezwon got this other version to work

The implementation I made was based on this kernel: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel
Seemed to work. :]

@zcain117
Copy link
Collaborator

I think Abhishek's super-duper-fast kernel is manually spawning processes using Parallel. That notebook imports import torch_xla.distributed.xla_multiprocessing as xmp but never uses it. Another issue is that it imports the parallel loader but never uses it: import torch_xla.distributed.parallel_loader as pl. I'm not 100% sure but it's possible that this notebook is just running 8 independent processes and each one is learning its own weights in isolation.

You should instead use his Kaggle notebook here which uses xmp.spawn https://www.kaggle.com/abhishek/i-like-clean-tpu-training-kernels-i-can-not-lie/notebook

The general flow for xmp.spawn is that you write a mp_fn, which is code that will be sent to each of the 8 TPU cores and each core will run that code independently but the xla_multiprocessing library will share gradients so that the processes are working together to make training progress. With that in mind, you can see it does not make sense to call xm.xla_device() with arguments to choose a device because the code is running on 1 core. Instead you just use xm.xla_device() to figure out which device your code is currently running on.

@lezwon
Copy link
Contributor

lezwon commented Jul 22, 2020

@zcain117 Both these approaches have been implemented in Lightning. If tpu_cores is 1 or 8, xmp.spawn is used. If its [1], [2] ... [8], the training proceeds on an individual tpu core on the current process. xm.xla_device() is used when the training happens using xmp.spawn. When it is an individual process xm.xla_device(tpu_id) is used.

Ref: https://github.com/PyTorchLightning/pytorch-lightning/blob/a3934ad04b62686c8161158f66b5fcf3e56a3889/pytorch_lightning/trainer/trainer.py#L1026

@williamFalcon
Copy link

yes exactly.

the point is that we can allow users to run on all 8 cores for 1 model. or 8 models each on a core

@zcain117
Copy link
Collaborator

I see, thanks for the clarification. That sounds like the right usage. However the error a few comments above shows this:

TPU available: True, using: 8 TPU cores
training on 8 TPU cores
Exception in device=TPU:0: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'

This implies to me that the code is giving a string argument to xla_device() even in the 8-core case?

That string arg is coming from here. Maybe self.tpu_id is meant to be none for the 8-core case but it's not for some reason?

@lezwon
Copy link
Contributor

lezwon commented Jul 22, 2020

That's odd. I never have seen that error before. As per my knowledge, there's only one place where tpu_id is assigned. And it seems fine: https://github.com/PyTorchLightning/pytorch-lightning/blob/6d10ac2ac84bc01f78eed381fd6152c869040b64/pytorch_lightning/trainer/trainer.py#L476

@Borda
Copy link
Author

Borda commented Jul 22, 2020

@zcain117 @lezwon lets move the discussion to Lightning-AI/pytorch-lightning#2632 as the error here is not valid anymore...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants