Skip to content

terminate called after throwing an instance of 'c10::Error' #17

@kaihe

Description

@kaihe

I am playing with ldm.models.diffusion.ddpm.LatentDiffusion with 4 GPUs and DDP distribution. After around 30 epochs, it stopped,

`terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame_#0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f082820c8b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame
#1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f082845ef20 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame
#2: c10::TensorImpl::release_resources() + 0x4d (0x7f08281f7b7d in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame
#3: + 0x5f65b2 (0x7f08725575b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame
#4: + 0x13c2bc (0x55b1c22232bc in /root/miniconda3/envs/ldm/bin/python)
frame
#5: + 0x1efd35 (0x55b1c22d6d35 in /root/miniconda3/envs/ldm/bin/python)
frame
#_6: PyObject_GC_Malloc + 0x88 (0x55b1c2223998 in /root/miniconda3/envs/ldm/bin/python)
frame
#7: PyType_GenericAlloc + 0x3b (0x55b1c2293a8b in /root/miniconda3/envs/ldm/bin/python)
frame
#8: + 0xc385 (0x7f08a1bbf385 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#9: + 0x13d585 (0x55b1c2224585 in /root/miniconda3/envs/ldm/bin/python)
frame
#10: + 0xf97f (0x7f08a1bc297f in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#11: + 0xfb7e (0x7f08a1bc2b7e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#12: + 0x1e857 (0x7f08a1bd1857 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame
#13: + 0x5f92c (0x55b1c214692c in /root/miniconda3/envs/ldm/bin/python)
frame
#14: + 0x16fb40 (0x55b1c2256b40 in /root/miniconda3/envs/ldm/bin/python)
frame
#_15: + 0xe4d6 (0x7f08a17a84d6 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mt19937.cpython-38-x86_64-linux-gnu.so)
frame
#16: + 0x13d60c (0x55b1c222460c in /root/miniconda3/envs/ldm/bin/python)
frame
#17: + 0x14231 (0x7f08a1bf4231 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame
#18: + 0x21d0e (0x7f08a1c01d0e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame
#_19: PyObject_MakeTpCall + 0x1a4 (0x55b1c22247d4 in /root/miniconda3/envs/ldm/bin/python)
frame
#_20: PyEval_EvalFrameDefault + 0x4596 (0x55b1c22abf56 in /root/miniconda3/envs/ldm/bin/python)
frame
#_21: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#_22: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#23: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame
#24: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_25: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_26: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_27: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#28: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame
#29: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_30: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_31: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_32: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#_33: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame
#_34: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame
#35: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame
#36: PyObject_Call + 0x70 (0x55b1c2224200 in /root/miniconda3/envs/ldm/bin/python)
frame
#_37: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_38: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#_39: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#_40: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame
#_41: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame
#42: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame
#_43: PyObject_MakeTpCall + 0x22f (0x55b1c222485f in /root/miniconda3/envs/ldm/bin/python)
frame
#_44: PyEval_EvalFrameDefault + 0x11d0 (0x55b1c22a8b90 in /root/miniconda3/envs/ldm/bin/python)
frame
#_45: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#46: + 0xba0de (0x55b1c21a10de in /root/miniconda3/envs/ldm/bin/python)
frame
#47: + 0x17eb32 (0x55b1c2265b32 in /root/miniconda3/envs/ldm/bin/python)
frame
#48: PyObject_GetItem + 0x49 (0x55b1c22568c9 in /root/miniconda3/envs/ldm/bin/python)
frame
#_49: PyEval_EvalFrameDefault + 0xbdd (0x55b1c22a859d in /root/miniconda3/envs/ldm/bin/python)
frame
#_50: PyEval_EvalCodeWithName + 0x659 (0x55b1c2271e19 in /root/miniconda3/envs/ldm/bin/python)
frame
#_51: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#52: + 0xfeb84 (0x55b1c21e5b84 in /root/miniconda3/envs/ldm/bin/python)
frame
#_53: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame
#_54: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame
#55: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame
#_56: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#57: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame
#_58: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame
#_59: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame
#60: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame
#_61: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame
#62: + 0x18bd20 (0x55b1c2272d20 in /root/miniconda3/envs/ldm/bin/python)
frame
#_63: + 0x10011a (0x55b1c21e711a in /root/miniconda3/envs/ldm/bin/python)

Epoch 37: 69%|\u258b| 227/328 [18:34<08:13, 4.89s/it, loss=0.794, v_num=2, train/loss_simple_step=0.792, train/loss_vlb_step=0.0081, traTraceback (most recent call last):
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
super().run(batch, batch_idx, dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
model_ref.optimizer_step(
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/optim/adamw.py", line 65, in step
loss = closure()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 537, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 307, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 383, in training_step
return self.model(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 343, in training_step
loss, loss_dict = self.shared_step(batch)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 887, in shared_step
x, c = self.get_input(batch, self.first_stage_key)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 661, in get_input
z = self.get_first_stage_encoding(encoder_posterior).detach()
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 544, in get_first_stage_encoding
z = encoder_posterior.sample()
File "/root/Desktop/ldm/ldm/modules/distributions/distributions.py", line 36, in sample
x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21388) is killed by signal: Aborted.
`

I am sure it is related to this issue , but unable to fix by setting rank_zero_only=True.

Any help is appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions