Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: process 3 terminated with signal SIGSEGV #2124

Closed
sambaths opened this issue Jun 8, 2020 · 4 comments · Fixed by #2632
Closed

Exception: process 3 terminated with signal SIGSEGV #2124

sambaths opened this issue Jun 8, 2020 · 4 comments · Fixed by #2632
Labels
question Further information is requested

Comments

@sambaths
Copy link

sambaths commented Jun 8, 2020

Hi,
I have been facing this issue on colab while training a GAN.

Environment:
pytorch/xla: nightly
pytorch-lightning: master

Error:

File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 871, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 3 terminated with signal SIGSEGV

I tried creating the metrics report. I got this and it popped out this error with it.

Metric: XrtAllocateFromTensor
  TotalSamples: 4032
  Accumulator: 18s194ms721.992us
  Mean: 004ms953.532us
  StdDev: 020ms533.255us
  Rate: 26.1545 / second
  Percentiles: 25%=627.008us; 50%=915.623us; 80%=005ms490.373us; 90%=009ms091.132us; 95%=010ms757.279us; 99%=012ms360.721us
Metric: XrtCompile
  TotalSamples: 176
  Accumulator: 11m49s071ms079.670us
  Mean: 04s688ms903.862us
  StdDev: 05s594ms399.924us
  Rate: 0.270883 / second
  Percentiles: 25%=003ms325.412us; 50%=005ms311.267us; 80%=09s375ms765.141us; 90%=10s084ms749.219us; 95%=10s242ms795.199us; 99%=13s303ms006.740us
Metric: XrtExecute
  TotalSamples: 1522
  Accumulator: 02m59s496ms518.753us
  Mean: 078ms760.743us
  StdDev: 057ms404.221us
  Rate: 1.75372 / second
  Percentiles: 25%=002ms954.943us; 50%=087ms484.699us; 80%=104ms102.739us; 90%=184ms763.321us; 95%=184ms169.253us; 99%=185ms319.953us
Metric: XrtExecutorEvict
  TotalSamples: 0
  Accumulator: nanB
  Mean: nanB
  StdDev: nanB
  Percentiles: 
Metric: XrtReadLiteral
  TotalSamples: 1364
  Accumulator: 954ms286.840us
  Mean: 701.158us
  StdDev: 324.895us
  Rate: 1.75001 / second
  Percentiles: 25%=545.918us; 50%=647.467us; 80%=791.931us; 90%=905.633us; 95%=001ms149.656us; 99%=002ms402.074us
Metric: XrtReleaseAllocation
  TotalSamples: 3645
  Accumulator: 447ms619.909us
  Mean: 152.585us
  StdDev: 235.781us
  Rate: 207.996 / second
  Percentiles: 25%=023.275us; 50%=036.971us; 80%=236.681us; 90%=452.955us; 95%=695.044us; 99%=001ms069.589us

GPU available: False, used: False
No environment variable for node rank defined. Set as 0.
Using 16bit precision.
training on 8 TPU cores
Traceback (most recent call last):
  File "train.py", line 155, in <module>
    trainer.fit(gan_model) 
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 871, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 228, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 221, in _setup_replication
    xm.set_replication(device, [device])
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 233, in set_replication
    replication_devices = xla_replication_devices(devices)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 206, in xla_replication_devices
    format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8

@sambaths sambaths added the question Further information is requested label Jun 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2020

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

Update to master and try again?

@sambaths
Copy link
Author

sambaths commented Jun 9, 2020

I tried updating from github, but the code is stuck at checkpoint_callback
The trainer tries to create a checkpoint, but it doesn't go beyond that.
I have encountered this issue earlier also(code stuck), but most of the time it throws the above error message ( Exception: process 3 terminated with signal SIGSEGV) and some times it gets stuck at this stage.

After removing checkpoint_callback the code runs through all epochs but after all the epochs are over, the code is stuck (doesn't stop execution).

In both cases, the code gets stuck and the cell doesn't stop its execution.

@Borda
Copy link
Member

Borda commented Jul 27, 2020

that is spawn issue, shall be fixed in #2632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants