Exception: process 3 terminated with signal SIGSEGV #2124

sambaths · 2020-06-08T22:00:28Z

Hi,
I have been facing this issue on colab while training a GAN.

Environment:
pytorch/xla: nightly
pytorch-lightning: master

Error:

File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 871, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 3 terminated with signal SIGSEGV

I tried creating the metrics report. I got this and it popped out this error with it.

Metric: XrtAllocateFromTensor
  TotalSamples: 4032
  Accumulator: 18s194ms721.992us
  Mean: 004ms953.532us
  StdDev: 020ms533.255us
  Rate: 26.1545 / second
  Percentiles: 25%=627.008us; 50%=915.623us; 80%=005ms490.373us; 90%=009ms091.132us; 95%=010ms757.279us; 99%=012ms360.721us
Metric: XrtCompile
  TotalSamples: 176
  Accumulator: 11m49s071ms079.670us
  Mean: 04s688ms903.862us
  StdDev: 05s594ms399.924us
  Rate: 0.270883 / second
  Percentiles: 25%=003ms325.412us; 50%=005ms311.267us; 80%=09s375ms765.141us; 90%=10s084ms749.219us; 95%=10s242ms795.199us; 99%=13s303ms006.740us
Metric: XrtExecute
  TotalSamples: 1522
  Accumulator: 02m59s496ms518.753us
  Mean: 078ms760.743us
  StdDev: 057ms404.221us
  Rate: 1.75372 / second
  Percentiles: 25%=002ms954.943us; 50%=087ms484.699us; 80%=104ms102.739us; 90%=184ms763.321us; 95%=184ms169.253us; 99%=185ms319.953us
Metric: XrtExecutorEvict
  TotalSamples: 0
  Accumulator: nanB
  Mean: nanB
  StdDev: nanB
  Percentiles: 
Metric: XrtReadLiteral
  TotalSamples: 1364
  Accumulator: 954ms286.840us
  Mean: 701.158us
  StdDev: 324.895us
  Rate: 1.75001 / second
  Percentiles: 25%=545.918us; 50%=647.467us; 80%=791.931us; 90%=905.633us; 95%=001ms149.656us; 99%=002ms402.074us
Metric: XrtReleaseAllocation
  TotalSamples: 3645
  Accumulator: 447ms619.909us
  Mean: 152.585us
  StdDev: 235.781us
  Rate: 207.996 / second
  Percentiles: 25%=023.275us; 50%=036.971us; 80%=236.681us; 90%=452.955us; 95%=695.044us; 99%=001ms069.589us

GPU available: False, used: False
No environment variable for node rank defined. Set as 0.
Using 16bit precision.
training on 8 TPU cores
Traceback (most recent call last):
  File "train.py", line 155, in <module>
    trainer.fit(gan_model) 
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 871, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 228, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 221, in _setup_replication
    xm.set_replication(device, [device])
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 233, in set_replication
    replication_devices = xla_replication_devices(devices)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 206, in xla_replication_devices
    format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8

The text was updated successfully, but these errors were encountered:

github-actions · 2020-06-08T22:01:12Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-06-09T00:06:26Z

Update to master and try again?

sambaths · 2020-06-09T07:38:05Z

I tried updating from github, but the code is stuck at checkpoint_callback
The trainer tries to create a checkpoint, but it doesn't go beyond that.
I have encountered this issue earlier also(code stuck), but most of the time it throws the above error message ( Exception: process 3 terminated with signal SIGSEGV) and some times it gets stuck at this stage.

After removing checkpoint_callback the code runs through all epochs but after all the epochs are over, the code is stuck (doesn't stop execution).

In both cases, the code gets stuck and the cell doesn't stop its execution.

Borda · 2020-07-27T20:43:29Z

that is spawn issue, shall be fixed in #2632

sambaths added the question Further information is requested label Jun 8, 2020

Borda mentioned this issue Jul 27, 2020

fixing TPU tests #2632

Merged

7 tasks

williamFalcon closed this as completed in #2632 Jul 27, 2020

FabianBell mentioned this issue Nov 1, 2020

SIGSEGV when training on TPU #4464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception: process 3 terminated with signal SIGSEGV #2124

Exception: process 3 terminated with signal SIGSEGV #2124

sambaths commented Jun 8, 2020

github-actions bot commented Jun 8, 2020

williamFalcon commented Jun 9, 2020

sambaths commented Jun 9, 2020 •

edited

Loading

Borda commented Jul 27, 2020

Exception: process 3 terminated with signal SIGSEGV #2124

Exception: process 3 terminated with signal SIGSEGV #2124

Comments

sambaths commented Jun 8, 2020

github-actions bot commented Jun 8, 2020

williamFalcon commented Jun 9, 2020

sambaths commented Jun 9, 2020 • edited Loading

Borda commented Jul 27, 2020

sambaths commented Jun 9, 2020 •

edited

Loading