Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: forward() missing 1 required positional argument: 'x' #1074

Closed
junglezhao opened this issue Apr 20, 2020 · 30 comments
Closed

TypeError: forward() missing 1 required positional argument: 'x' #1074

junglezhao opened this issue Apr 20, 2020 · 30 comments
Labels
bug Something isn't working Stale

Comments

@junglezhao
Copy link

junglezhao commented Apr 20, 2020

🐛 Bug

A clear and concise description of what the bug is.
Hi guys, when test.py runs at 99%, it occurs to a error like the following :
(I don't change the file...)

Traceback (most recent call last):
  File "test.py", line 255, in <module>
    opt.augment
  File "test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

@junglezhao junglezhao added the bug Something isn't working label Apr 20, 2020
@github-actions
Copy link

github-actions bot commented Apr 20, 2020

Hello @junglezhao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

@qtw1998
Copy link

qtw1998 commented Apr 20, 2020

use --augment

@glenn-jocher
Copy link
Member

@junglezhao I would make sure your code is up to date using git pull, and if the issue persists please provide minimum reproducible example code.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 21, 2020

@qtw1998 @junglezhao yes an augment boolean can be passed to the model() forward method to conduct augmented inference for higher recall and better mAP, but it is not a required argument, as a default False value is supplied. Nevertheless you can do augmented inference from the command line with the --augment argparser argument:

python3 test.py --augment
python3 detect.py --augment

def forward(self, x, augment=False, verbose=False):

@junglezhao
Copy link
Author

use --augment

ok ,thx. I chose to redownload rep and reset the config to solve this problem.

@Rajat-Mehta
Copy link

Rajat-Mehta commented Apr 26, 2020

I am also getting a similar error. I followed the instructions given to train yolov3 on custom dataset. I have prepared my custom dataset according to the required format. When I start training, I get the following error:

Traceback (most recent call last):
  File "train.py", line 422, in <module>
    train()  # train normally
  File "train.py", line 317, in train
    dataloader=testloader)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)  # inference and training outputs
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

I think the error occurs during testing, but I have no idea why it is occurring. What can be the reason for this error?

@leoll2
Copy link

leoll2 commented Apr 26, 2020

I encountered the same bug when testing (on 8 GPUs), in the last minibatch to be precise.
A workaround is to skip the last test iteration: not really the definitive solution, but it works.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 26, 2020

@leoll2 @Rajat-Mehta your code may be out of date, I would advise a git pull or to reclone the current repo.

@Rajat-Mehta
Copy link

@glenn-jocher I already tried to pull the latest code. That did not solve my problem.

This error is encountered while training and testing on multiple gpus, I tried to train on single GPU and that resolved my error.

@glenn-jocher
Copy link
Member

@Rajat-Mehta ok thank you. Are you able to reproduce the error on an open dataset like coco64.data? If so please send us exact code to reproduce and we can get started debugging it.

@Rajat-Mehta
Copy link

I updated pytorch from 1.4 to 1.5 and now the training process is not working on multiple GPUs even on coco dataset. But the training works fine when I train using single GPU.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 2, 2020

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

@berkerlogoglu
Copy link

@glenn-jocher , @junglezhao, @leoll2 I can confirm that this bug still exists. We are using an up to date repo and we get exactly the same error using 4 GPUs at exactly the same point when testing the last minibatch. . The problem does not exist when using single or double GPUs.

Here is the full trace:

Class Images Targets P R [email protected] F1: 100% 1548/1549 [22:49<00:01, 1.01s/it]ATraceback (most recent call last):

File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 98, in test
inf_out, train_out = model(imgs, augment=augment) # inference and training outputs
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Any other suggestion other than @leoll2's skipping last iteration?

@glenn-jocher
Copy link
Member

glenn-jocher commented May 4, 2020

@berkerlogoglu thanks. Can you reproduce this error in a common environment (i.e. the docker image or a gcp vm) on an open dataset like coco64.data?

Without this we can not debug.

@glenn-jocher glenn-jocher reopened this May 4, 2020
@kaanakan
Copy link

kaanakan commented May 5, 2020

Hi @glenn-jocher,

I am working with @berkerlogoglu. I have tried your docker image. Here is my obversations:

The comment written by @berkerlogoglu, #1074 (comment), was using a custom validation which has approximately 99k images in a different docker. After your suggestion, I have tried it with your docker image and the error occurred again.

After that, I have tried with coco64.data, nothing happened. I thought error occurs in very big datasets and I tried some custom coco validation set with approximately 2k images.

First, I used 16 batch size which makes 125 batches to process, no error occurred.
Then, I used 2 batch size which makes 1000 batches to process and the same error occurred.

The error log is:

Traceback (most recent call last): File "train.py", line 475, in train() # train normally File "train.py", line 349, in train dataloader=testloader) File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 101, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'x'

I hope we can find a way to solve this problem..
Thanks..

@leoll2
Copy link

leoll2 commented May 5, 2020

@kaanakan I don't think the dataset size plays a big role. I got the error on a relatively small dataset (~2500 images, 400 validation).

@glenn-jocher
Copy link
Member

@kaanakan @leoll2 ok thanks. I need to be able to reproduce the issue with a common dataset otherwise we can not debug it.

From what I gather above the error only appears on 4-GPU testing of specific datasets. It is not reproducible on coco64.data. Is it reproducible with coco2017.data or coco2014.data?

@glenn-jocher glenn-jocher added the TODO Tasks to be completed label May 7, 2020
@glenn-jocher glenn-jocher changed the title How to solve this problem TypeError: forward() missing 1 required positional argument: 'x' when testing 99% TypeError: forward() missing 1 required positional argument: 'x' May 7, 2020
@glenn-jocher
Copy link
Member

@joel5638 as I've mentioned to the others, if you can reproduce this error in a reproducible environment with a reproducible dataset then we can debug. i.e. send us a google colab notebook producing the error on coco if you can.

@Hidayat722
Copy link

Hidayat722 commented May 10, 2020

So the fix is just make a check on the batch size
batch_sz = imgs.size()[0]
if batch_size == batch_sz:
# Disable gradients
with torch.no_grad()
etc.........

i tried to push but was not able, for some reason the batch size in the last epoch is not equal to the size of orginal batch_size that's why the error occurs.

@Hidayat722
Copy link

@joel5638 can you please make the changes in the code
Thanks

@glenn-jocher
Copy link
Member

@Hidayat722 that's normal for batch sizes to vary, it should not cause a bug. We can not implement your proposed fix, as this will omit mAP computations on the last batch. If you can reproduce this error, please reproduce in a colab notebook on coco so that we may run it ourselves and debug.

@leiyuncong1202
Copy link

Hello, I also encounter a similar problem. This error occurs when using multiple GPUs for training and testing. Is it caused by different kinds of GPUs? This is the details of my device

Using CUDA device0 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB) device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device2 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device3 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB)

@glenn-jocher
Copy link
Member

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

@leiyuncong1202
Copy link

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

According to your suggestion, my problem has been solved. Thank you~

@linzzzzzz
Copy link
Contributor

linzzzzzz commented Jun 19, 2020

I also came across this error today when testing using 3 GPUs.
TypeError: forward() missing 1 required positional argument: 'x'

Edit:
Want to note that the issue seems to be related to the batch size. A batch size of 18 works but not a batch size of 21. Here is a similar issue found from another repo: Eromera/erfnet_pytorch#2

@glenn-jocher
Copy link
Member

@linzzzzzz best practices is to use even numbers of GPUs at all times if you use > 1.

@linzzzzzz
Copy link
Contributor

@glenn-jocher Thanks for the suggestion :)

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 20, 2020
@glenn-jocher glenn-jocher removed the TODO Tasks to be completed label Jan 9, 2021
@tommyma0402
Copy link

tommyma0402 commented Jul 15, 2021

I don't know if this is still relevant. I am currently work on a project and needs to use the archive branch. I ran into the this problem and found a workaround. You can just reconstruct your trian.txt and test.txt file (with different ratio or just randomize them again).

Hypothesis: I haven't done any control experiment yet with coco dataset. But by reading the comment and some experiment I did with my own, it might to do with the number of image on the last batch of the test dataset. This comment might not be relevant since this could be solved in the master branch.

If the hypothesis was true, the workaround could be as simple as delete one or two line of image from the test.txt or train.txt.

Edit: Try some more experiment when encountered, I think the source of the bug is last test batch does not have enough input to fill all the GPU. @glenn-jocher This bug can be reproduced when (number of test sample % batch size < number of GPU) For example, number of test sample = 25, batch size = 24(3x8), number of GPU = 3. Since last batch only has 1 image, the forward will have missing parameter in other two GPU. Fix: if GPU count is low, simply add few samples to fit GPU count or delete few samples. If GPU count is high, well...

@glenn-jocher
Copy link
Member

@tommyma0402 thanks for sharing your findings! Your investigation and insights are valuable for the community. This indeed seems like a valid hypothesis and a practical workaround for this issue. Your thorough experiment and proposed fix can help others who encounter the same problem. Keep up the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests