TypeError: forward() missing 1 required positional argument: 'x' #1074

junglezhao · 2020-04-20T13:41:56Z

🐛 Bug

A clear and concise description of what the bug is.
Hi guys, when test.py runs at 99%, it occurs to a error like the following :
(I don't change the file...)

Traceback (most recent call last):
  File "test.py", line 255, in <module>
    opt.augment
  File "test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

The text was updated successfully, but these errors were encountered:

github-actions · 2020-04-20T13:42:35Z

Hello @junglezhao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

qtw1998 · 2020-04-20T14:43:58Z

use --augment

glenn-jocher · 2020-04-20T16:45:38Z

@junglezhao I would make sure your code is up to date using git pull, and if the issue persists please provide minimum reproducible example code.

glenn-jocher · 2020-04-21T17:25:45Z

@qtw1998 @junglezhao yes an augment boolean can be passed to the model() forward method to conduct augmented inference for higher recall and better mAP, but it is not a required argument, as a default False value is supplied. Nevertheless you can do augmented inference from the command line with the --augment argparser argument:

python3 test.py --augment
python3 detect.py --augment

yolov3/models.py

Line 232 in 4c4f4f4

def forward(self, x, augment=False, verbose=False):

junglezhao · 2020-04-21T23:50:18Z

use --augment

ok ,thx. I chose to redownload rep and reset the config to solve this problem.

Rajat-Mehta · 2020-04-26T17:21:18Z

I am also getting a similar error. I followed the instructions given to train yolov3 on custom dataset. I have prepared my custom dataset according to the required format. When I start training, I get the following error:

Traceback (most recent call last):
  File "train.py", line 422, in <module>
    train()  # train normally
  File "train.py", line 317, in train
    dataloader=testloader)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)  # inference and training outputs
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

I think the error occurs during testing, but I have no idea why it is occurring. What can be the reason for this error?

leoll2 · 2020-04-26T17:32:36Z

I encountered the same bug when testing (on 8 GPUs), in the last minibatch to be precise.
A workaround is to skip the last test iteration: not really the definitive solution, but it works.

glenn-jocher · 2020-04-26T18:08:26Z

@leoll2 @Rajat-Mehta your code may be out of date, I would advise a git pull or to reclone the current repo.

Rajat-Mehta · 2020-04-26T19:32:06Z

@glenn-jocher I already tried to pull the latest code. That did not solve my problem.

This error is encountered while training and testing on multiple gpus, I tried to train on single GPU and that resolved my error.

glenn-jocher · 2020-04-26T21:29:25Z

@Rajat-Mehta ok thank you. Are you able to reproduce the error on an open dataset like coco64.data? If so please send us exact code to reproduce and we can get started debugging it.

Rajat-Mehta · 2020-05-02T11:19:20Z

I updated pytorch from 1.4 to 1.5 and now the training process is not working on multiple GPUs even on coco dataset. But the training works fine when I train using single GPU.

glenn-jocher · 2020-05-02T18:02:20Z

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

GCP Deep Learning VM with $300 free credit offer: GCP Quickstart Guide
Google Colab Notebook with 12 hours of free GPU time: Google Colab Notebook
Docker Image from https://hub.docker.com/r/ultralytics/yolov3. See Docker Quickstart Guide

berkerlogoglu · 2020-05-04T18:27:35Z

@glenn-jocher , @junglezhao, @leoll2 I can confirm that this bug still exists. We are using an up to date repo and we get exactly the same error using 4 GPUs at exactly the same point when testing the last minibatch. . The problem does not exist when using single or double GPUs.

Here is the full trace:

Class Images Targets P R [email protected] F1: 100% 1548/1549 [22:49<00:01, 1.01s/it]ATraceback (most recent call last):

File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 98, in test
inf_out, train_out = model(imgs, augment=augment) # inference and training outputs
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Any other suggestion other than @leoll2's skipping last iteration?

glenn-jocher · 2020-05-04T18:33:08Z

@berkerlogoglu thanks. Can you reproduce this error in a common environment (i.e. the docker image or a gcp vm) on an open dataset like coco64.data?

Without this we can not debug.

kaanakan · 2020-05-05T15:52:47Z

Hi @glenn-jocher,

I am working with @berkerlogoglu. I have tried your docker image. Here is my obversations:

The comment written by @berkerlogoglu, #1074 (comment), was using a custom validation which has approximately 99k images in a different docker. After your suggestion, I have tried it with your docker image and the error occurred again.

After that, I have tried with coco64.data, nothing happened. I thought error occurs in very big datasets and I tried some custom coco validation set with approximately 2k images.

First, I used 16 batch size which makes 125 batches to process, no error occurred.
Then, I used 2 batch size which makes 1000 batches to process and the same error occurred.

The error log is:

Traceback (most recent call last): File "train.py", line 475, in train() # train normally File "train.py", line 349, in train dataloader=testloader) File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 101, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'x'

I hope we can find a way to solve this problem..
Thanks..

leoll2 · 2020-05-05T16:24:51Z

@kaanakan I don't think the dataset size plays a big role. I got the error on a relatively small dataset (~2500 images, 400 validation).

glenn-jocher · 2020-05-05T17:36:27Z

@kaanakan @leoll2 ok thanks. I need to be able to reproduce the issue with a common dataset otherwise we can not debug it.

From what I gather above the error only appears on 4-GPU testing of specific datasets. It is not reproducible on coco64.data. Is it reproducible with coco2017.data or coco2014.data?

glenn-jocher · 2020-05-10T23:01:25Z

@joel5638 as I've mentioned to the others, if you can reproduce this error in a reproducible environment with a reproducible dataset then we can debug. i.e. send us a google colab notebook producing the error on coco if you can.

Hidayat722 · 2020-05-10T23:29:50Z

So the fix is just make a check on the batch size
batch_sz = imgs.size()[0]
if batch_size == batch_sz:
# Disable gradients
with torch.no_grad()
etc.........

i tried to push but was not able, for some reason the batch size in the last epoch is not equal to the size of orginal batch_size that's why the error occurs.

Hidayat722 · 2020-05-10T23:54:06Z

@joel5638 can you please make the changes in the code
Thanks

glenn-jocher · 2020-05-11T03:50:44Z

@Hidayat722 that's normal for batch sizes to vary, it should not cause a bug. We can not implement your proposed fix, as this will omit mAP computations on the last batch. If you can reproduce this error, please reproduce in a colab notebook on coco so that we may run it ourselves and debug.

leiyuncong1202 · 2020-05-17T16:32:25Z

Hello, I also encounter a similar problem. This error occurs when using multiple GPUs for training and testing. Is it caused by different kinds of GPUs? This is the details of my device

Using CUDA device0 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB) device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device2 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device3 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB)

glenn-jocher · 2020-05-17T18:46:21Z

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

leiyuncong1202 · 2020-05-27T08:49:52Z

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

According to your suggestion, my problem has been solved. Thank you~

linzzzzzz · 2020-06-19T00:27:08Z

I also came across this error today when testing using 3 GPUs.
TypeError: forward() missing 1 required positional argument: 'x'

Edit:
Want to note that the issue seems to be related to the batch size. A batch size of 18 works but not a batch size of 21. Here is a similar issue found from another repo: Eromera/erfnet_pytorch#2

glenn-jocher · 2020-06-19T02:05:20Z

@linzzzzzz best practices is to use even numbers of GPUs at all times if you use > 1.

linzzzzzz · 2020-06-19T05:21:38Z

@glenn-jocher Thanks for the suggestion :)

github-actions · 2020-07-20T00:22:21Z

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

tommyma0402 · 2021-07-15T20:45:55Z

I don't know if this is still relevant. I am currently work on a project and needs to use the archive branch. I ran into the this problem and found a workaround. You can just reconstruct your trian.txt and test.txt file (with different ratio or just randomize them again).

Hypothesis: I haven't done any control experiment yet with coco dataset. But by reading the comment and some experiment I did with my own, it might to do with the number of image on the last batch of the test dataset. This comment might not be relevant since this could be solved in the master branch.

If the hypothesis was true, the workaround could be as simple as delete one or two line of image from the test.txt or train.txt.

Edit: Try some more experiment when encountered, I think the source of the bug is last test batch does not have enough input to fill all the GPU. @glenn-jocher This bug can be reproduced when (number of test sample % batch size < number of GPU) For example, number of test sample = 25, batch size = 24(3x8), number of GPU = 3. Since last batch only has 1 image, the forward will have missing parameter in other two GPU. Fix: if GPU count is low, simply add few samples to fit GPU count or delete few samples. If GPU count is high, well...

glenn-jocher · 2023-11-14T17:16:19Z

@tommyma0402 thanks for sharing your findings! Your investigation and insights are valuable for the community. This indeed seems like a valid hypothesis and a practical workaround for this issue. Your thorough experiment and proposed fix can help others who encounter the same problem. Keep up the great work!

junglezhao added the bug Something isn't working label Apr 20, 2020

junglezhao closed this as completed Apr 22, 2020

glenn-jocher reopened this May 4, 2020

glenn-jocher added the TODO Tasks to be completed label May 7, 2020

glenn-jocher changed the title ~~How to solve this problem TypeError: forward() missing 1 required positional argument: 'x' when testing 99%~~ TypeError: forward() missing 1 required positional argument: 'x' May 7, 2020

glenn-jocher mentioned this issue May 7, 2020

training error #1142

Closed

github-actions bot added the Stale label Jul 20, 2020

myunghakLee mentioned this issue Jul 20, 2020

Modify DataLoader #1398

Closed

github-actions bot closed this as completed Jul 26, 2020

glenn-jocher removed the TODO Tasks to be completed label Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: forward() missing 1 required positional argument: 'x' #1074

TypeError: forward() missing 1 required positional argument: 'x' #1074

junglezhao commented Apr 20, 2020 •

edited by glenn-jocher

Loading

github-actions bot commented Apr 20, 2020 •

edited by glenn-jocher

Loading

qtw1998 commented Apr 20, 2020

glenn-jocher commented Apr 20, 2020

glenn-jocher commented Apr 21, 2020 •

edited

Loading

junglezhao commented Apr 21, 2020

Rajat-Mehta commented Apr 26, 2020 •

edited

Loading

leoll2 commented Apr 26, 2020

glenn-jocher commented Apr 26, 2020 •

edited

Loading

Rajat-Mehta commented Apr 26, 2020

glenn-jocher commented Apr 26, 2020

Rajat-Mehta commented May 2, 2020

glenn-jocher commented May 2, 2020 •

edited

Loading

berkerlogoglu commented May 4, 2020

glenn-jocher commented May 4, 2020 •

edited

Loading

kaanakan commented May 5, 2020

leoll2 commented May 5, 2020

glenn-jocher commented May 5, 2020

glenn-jocher commented May 10, 2020

Hidayat722 commented May 10, 2020 •

edited

Loading

Hidayat722 commented May 10, 2020

glenn-jocher commented May 11, 2020

leiyuncong1202 commented May 17, 2020

glenn-jocher commented May 17, 2020

leiyuncong1202 commented May 27, 2020

linzzzzzz commented Jun 19, 2020 •

edited

Loading

glenn-jocher commented Jun 19, 2020

linzzzzzz commented Jun 19, 2020

github-actions bot commented Jul 20, 2020

tommyma0402 commented Jul 15, 2021 •

edited

Loading

glenn-jocher commented Nov 14, 2023

TypeError: forward() missing 1 required positional argument: 'x' #1074

TypeError: forward() missing 1 required positional argument: 'x' #1074

Comments

junglezhao commented Apr 20, 2020 • edited by glenn-jocher Loading

🐛 Bug

github-actions bot commented Apr 20, 2020 • edited by glenn-jocher Loading

qtw1998 commented Apr 20, 2020

glenn-jocher commented Apr 20, 2020

glenn-jocher commented Apr 21, 2020 • edited Loading

junglezhao commented Apr 21, 2020

Rajat-Mehta commented Apr 26, 2020 • edited Loading

leoll2 commented Apr 26, 2020

glenn-jocher commented Apr 26, 2020 • edited Loading

Rajat-Mehta commented Apr 26, 2020

glenn-jocher commented Apr 26, 2020

Rajat-Mehta commented May 2, 2020

glenn-jocher commented May 2, 2020 • edited Loading

Reproduce Our Environment

berkerlogoglu commented May 4, 2020

glenn-jocher commented May 4, 2020 • edited Loading

kaanakan commented May 5, 2020

leoll2 commented May 5, 2020

glenn-jocher commented May 5, 2020

glenn-jocher commented May 10, 2020

Hidayat722 commented May 10, 2020 • edited Loading

Hidayat722 commented May 10, 2020

glenn-jocher commented May 11, 2020

leiyuncong1202 commented May 17, 2020

glenn-jocher commented May 17, 2020

leiyuncong1202 commented May 27, 2020

linzzzzzz commented Jun 19, 2020 • edited Loading

glenn-jocher commented Jun 19, 2020

linzzzzzz commented Jun 19, 2020

github-actions bot commented Jul 20, 2020

tommyma0402 commented Jul 15, 2021 • edited Loading

glenn-jocher commented Nov 14, 2023

junglezhao commented Apr 20, 2020 •

edited by glenn-jocher

Loading

github-actions bot commented Apr 20, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Apr 21, 2020 •

edited

Loading

Rajat-Mehta commented Apr 26, 2020 •

edited

Loading

glenn-jocher commented Apr 26, 2020 •

edited

Loading

glenn-jocher commented May 2, 2020 •

edited

Loading

glenn-jocher commented May 4, 2020 •

edited

Loading

Hidayat722 commented May 10, 2020 •

edited

Loading

linzzzzzz commented Jun 19, 2020 •

edited

Loading

tommyma0402 commented Jul 15, 2021 •

edited

Loading