Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StopIteration: Caught StopIteration in replica 0 on device 0. #123

Open
codybai opened this issue Oct 28, 2020 · 6 comments
Open

StopIteration: Caught StopIteration in replica 0 on device 0. #123

codybai opened this issue Oct 28, 2020 · 6 comments

Comments

@codybai
Copy link

codybai commented Oct 28, 2020

No description provided.

@tianylin98
Copy link

nn.ParameterList is used in the code, which seems to be incompatible with nn.DataParallel. This will cause the replica to be empty.

I think this is the problem.

@RodenLuo
Copy link

I run into the same error. I wonder is this solved? Thanks.

@tianylin98
Copy link

I run into the same error. I wonder is this solved? Thanks.

You can downgrade your torch to 1.4.0, which works fine for me (hint: you might have to change your cuda toolkit to lower versions, too).

@RodenLuo
Copy link

RodenLuo commented Apr 7, 2021

I confirm with the following env:

name: pt1.4
channels:
  - pytorch
  - salilab
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_llvm
  - ca-certificates=2020.12.5=ha878542_0
  - certifi=2020.12.5=py38h578d9bd_1
  - cudatoolkit=10.1.243=h036e899_8
  - freetype=2.10.4=h0708190_1
  - jpeg=9d=h36c2ea0_0
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.35.1=hea4e1c9_2
  - libblas=3.9.0=8_openblas
  - libcblas=3.9.0=8_openblas
  - libffi=3.3=h58526e2_2
  - libgcc-ng=9.3.0=h2828fa1_18
  - libgfortran-ng=9.3.0=hff62375_18
  - libgfortran5=9.3.0=hff62375_18
  - liblapack=3.9.0=8_openblas
  - libopenblas=0.3.12=pthreads_h4812303_1
  - libpng=1.6.37=h21135ba_2
  - libstdcxx-ng=9.3.0=h6de172a_18
  - libtiff=4.2.0=hdc55705_0
  - libwebp-base=1.2.0=h7f98852_2
  - llvm-openmp=11.1.0=h4bd325d_0
  - lz4-c=1.9.3=h9c3ff4c_0
  - mkl=2020.4=h726a3e6_304
  - ncurses=6.2=h58526e2_4
  - ninja=1.10.2=h4bd325d_0
  - numpy=1.20.1=py38h18fd61f_0
  - olefile=0.46=pyh9f0ad1d_1
  - openssl=1.1.1j=h7f98852_0
  - pillow=8.1.2=py38ha0e1e83_0
  - pip=21.0.1=pyhd8ed1ab_0
  - python=3.8.8=hffdb5ce_0_cpython
  - python_abi=3.8=1_cp38
  - pytorch=1.4.0=py3.8_cuda10.1.243_cudnn7.6.3_0
  - readline=8.0=he28a2e2_2
  - setuptools=49.6.0=py38h578d9bd_3
  - six=1.15.0=pyh9f0ad1d_0
  - sqlite=3.35.2=h74cdb3f_0
  - tk=8.6.10=h21135ba_1
  - torchvision=0.5.0=py38_cu101
  - wheel=0.36.2=pyhd3deb0d_0
  - xz=5.2.5=h516909a_1
  - zlib=1.2.11=h516909a_1010
  - zstd=1.4.9=ha95c52a_0

with more than 1 GPU cards (otherwise one will get a dividing by 0 error)

with mem_transformer.py line 754 changed to

loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1))

(use reshape instead of view)

it can run through for bash run_wt103_base.sh train --work_dir TRAIN_wt103 command.

@zueigung1419
Copy link

when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well.
The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3.
One solution that works for me is as follows:
(1) define a dumb parameter in the _init_ function of MemTransformerLM class at line 495
self.null = nn.Parameter(torch.tensor(0.0))
(2) replace the init_mems(self) function with

def init_mems(self):
if self.mem_len > 0:
mems = []
# param = next(self.parameters())
for i in range(self.n_layer+1):
# empty = torch.empty(0, dtype=param.dtype, device=param.device)
empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device)
mems.append(empty)
return mems
else:
return None

(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).

Note all the changes are made in mem_transformer.py.

@TrueNobility303
Copy link

when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well. The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3. One solution that works for me is as follows: (1) define a dumb parameter in the init function of MemTransformerLM class at line 495 self.null = nn.Parameter(torch.tensor(0.0)) (2) replace the init_mems(self) function with

def init_mems(self): if self.mem_len > 0: mems = [] # param = next(self.parameters()) for i in range(self.n_layer+1): # empty = torch.empty(0, dtype=param.dtype, device=param.device) empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device) mems.append(empty) return mems else: return None

(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).

Note all the changes are made in mem_transformer.py.

when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well. The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3. One solution that works for me is as follows: (1) define a dumb parameter in the init function of MemTransformerLM class at line 495 self.null = nn.Parameter(torch.tensor(0.0)) (2) replace the init_mems(self) function with

def init_mems(self): if self.mem_len > 0: mems = [] # param = next(self.parameters()) for i in range(self.n_layer+1): # empty = torch.empty(0, dtype=param.dtype, device=param.device) empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device) mems.append(empty) return mems else: return None

(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).

Note all the changes are made in mem_transformer.py.

Thanks a lot for this solution. I also have the same bug when using pytorch=2.0.0. And this solution works well for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants