Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the device inconsistency error in yolov7 training #397

Merged
merged 6 commits into from
Dec 27, 2022

Conversation

zhubochao
Copy link
Contributor

Motivation

when training custom data with yolov7, it give [RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)]. The same problem was shown stackoverflow and official yolov7 issue#1224
The cause seems to be that in mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py line 309 _from_which_layer = _from_which_layer[fg_mask_inboxes], the indicesfg_mask_inboxes and _from_which_layer are not in the same device.

sys.platform: linux
Python: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2: Tesla T4
CUDA_HOME: /home/zbc/miniconda3/envs/test
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
PyTorch: 1.13.1
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.6
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.14.1
OpenCV: 4.6.0
MMEngine: 0.3.2
MMCV: 2.0.0rc3
MMDetection: 3.0.0rc4
MMYOLO: 0.2.0+27487fd

In the commet of this blog, one says an easy downgrade of pytorch and cuda from pytorch1.13 cu117 to pytorch1.7 cu110 could solve this problem.

Modification

move matching_matrix to the same device of _from_which_layer

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


zhubochao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@zhubochao zhubochao changed the base branch from main to dev December 20, 2022 15:23
@hhaAndroid
Copy link
Collaborator

@zhubochao please fix lint and sign CLA

Copy link
Contributor Author

@zhubochao zhubochao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit fixed lint error

@PeterH0323 PeterH0323 changed the title [FIX] Indices should be either on cpu or on the same device as the in… [Fix] Indices should be either on cpu or on the same device as the in… Dec 21, 2022
@RangeKing
Copy link
Collaborator

Hi @zhubochao, thanks for your kind PR. Could you please click the badge to sign the CLA so that we could merge this PR.
image

@zhubochao zhubochao requested a review from hhaAndroid December 27, 2022 03:47
@hhaAndroid hhaAndroid changed the title [Fix] Indices should be either on cpu or on the same device as the in… Fix the device inconsistency error in yolov7 training Dec 27, 2022
@hhaAndroid hhaAndroid merged commit 6c5acd2 into open-mmlab:dev Dec 27, 2022
@zhubochao zhubochao deleted the zbc/fix_yolov7_assigner branch January 6, 2023 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants