-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: forward() missing 1 required positional argument: 'x' #1074
Comments
Hello @junglezhao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments. If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. |
use |
@junglezhao I would make sure your code is up to date using git pull, and if the issue persists please provide minimum reproducible example code. |
@qtw1998 @junglezhao yes an augment boolean can be passed to the model() forward method to conduct augmented inference for higher recall and better mAP, but it is not a required argument, as a default python3 test.py --augment
python3 detect.py --augment Line 232 in 4c4f4f4
|
ok ,thx. I chose to redownload rep and reset the config to solve this problem. |
I am also getting a similar error. I followed the instructions given to train yolov3 on custom dataset. I have prepared my custom dataset according to the required format. When I start training, I get the following error:
I think the error occurs during testing, but I have no idea why it is occurring. What can be the reason for this error? |
I encountered the same bug when testing (on 8 GPUs), in the last minibatch to be precise. |
@leoll2 @Rajat-Mehta your code may be out of date, I would advise a |
@glenn-jocher I already tried to pull the latest code. That did not solve my problem. This error is encountered while training and testing on multiple gpus, I tried to train on single GPU and that resolved my error. |
@Rajat-Mehta ok thank you. Are you able to reproduce the error on an open dataset like coco64.data? If so please send us exact code to reproduce and we can get started debugging it. |
I updated pytorch from 1.4 to 1.5 and now the training process is not working on multiple GPUs even on coco dataset. But the training works fine when I train using single GPU. |
Reproduce Our EnvironmentTo access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:
|
@glenn-jocher , @junglezhao, @leoll2 I can confirm that this bug still exists. We are using an up to date repo and we get exactly the same error using 4 GPUs at exactly the same point when testing the last minibatch. . The problem does not exist when using single or double GPUs. Here is the full trace: Class Images Targets P R [email protected] F1: 100% 1548/1549 [22:49<00:01, 1.01s/it]ATraceback (most recent call last): File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 98, in test Any other suggestion other than @leoll2's skipping last iteration? |
@berkerlogoglu thanks. Can you reproduce this error in a common environment (i.e. the docker image or a gcp vm) on an open dataset like coco64.data? Without this we can not debug. |
Hi @glenn-jocher, I am working with @berkerlogoglu. I have tried your docker image. Here is my obversations: The comment written by @berkerlogoglu, #1074 (comment), was using a custom validation which has approximately 99k images in a different docker. After your suggestion, I have tried it with your docker image and the error occurred again. After that, I have tried with coco64.data, nothing happened. I thought error occurs in very big datasets and I tried some custom coco validation set with approximately 2k images. First, I used 16 batch size which makes 125 batches to process, no error occurred. The error log is:
I hope we can find a way to solve this problem.. |
@kaanakan I don't think the dataset size plays a big role. I got the error on a relatively small dataset (~2500 images, 400 validation). |
@kaanakan @leoll2 ok thanks. I need to be able to reproduce the issue with a common dataset otherwise we can not debug it. From what I gather above the error only appears on 4-GPU testing of specific datasets. It is not reproducible on coco64.data. Is it reproducible with coco2017.data or coco2014.data? |
@joel5638 as I've mentioned to the others, if you can reproduce this error in a reproducible environment with a reproducible dataset then we can debug. i.e. send us a google colab notebook producing the error on coco if you can. |
So the fix is just make a check on the batch size i tried to push but was not able, for some reason the batch size in the last epoch is not equal to the size of orginal batch_size that's why the error occurs. |
@joel5638 can you please make the changes in the code |
@Hidayat722 that's normal for batch sizes to vary, it should not cause a bug. We can not implement your proposed fix, as this will omit mAP computations on the last batch. If you can reproduce this error, please reproduce in a colab notebook on coco so that we may run it ourselves and debug. |
Hello, I also encounter a similar problem. This error occurs when using multiple GPUs for training and testing. Is it caused by different kinds of GPUs? This is the details of my device
|
@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2 |
According to your suggestion, my problem has been solved. Thank you~ |
I also came across this error today when testing using 3 GPUs. Edit: |
@linzzzzzz best practices is to use even numbers of GPUs at all times if you use > 1. |
@glenn-jocher Thanks for the suggestion :) |
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days. |
I don't know if this is still relevant. I am currently work on a project and needs to use the archive branch. I ran into the this problem and found a workaround. You can just reconstruct your trian.txt and test.txt file (with different ratio or just randomize them again). Hypothesis: I haven't done any control experiment yet with coco dataset. But by reading the comment and some experiment I did with my own, it might to do with the number of image on the last batch of the test dataset. This comment might not be relevant since this could be solved in the master branch. If the hypothesis was true, the workaround could be as simple as delete one or two line of image from the test.txt or train.txt. Edit: Try some more experiment when encountered, I think the source of the bug is last test batch does not have enough input to fill all the GPU. @glenn-jocher This bug can be reproduced when (number of test sample % batch size < number of GPU) For example, number of test sample = 25, batch size = 24(3x8), number of GPU = 3. Since last batch only has 1 image, the forward will have missing parameter in other two GPU. Fix: if GPU count is low, simply add few samples to fit GPU count or delete few samples. If GPU count is high, well... |
@tommyma0402 thanks for sharing your findings! Your investigation and insights are valuable for the community. This indeed seems like a valid hypothesis and a practical workaround for this issue. Your thorough experiment and proposed fix can help others who encounter the same problem. Keep up the great work! |
🐛 Bug
A clear and concise description of what the bug is.
Hi guys, when test.py runs at 99%, it occurs to a error like the following :
(I don't change the file...)
The text was updated successfully, but these errors were encountered: