Fix the device inconsistency error in yolov7 training #397

zhubochao · 2022-12-20T15:09:51Z

Motivation

when training custom data with yolov7, it give [RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)]. The same problem was shown stackoverflow and official yolov7 issue#1224
The cause seems to be that in mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py line 309 _from_which_layer = _from_which_layer[fg_mask_inboxes], the indicesfg_mask_inboxes and _from_which_layer are not in the same device.

sys.platform: linux
Python: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2: Tesla T4
CUDA_HOME: /home/zbc/miniconda3/envs/test
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
PyTorch: 1.13.1
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.6
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.3.2 (built against CUDA 11.5)
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.14.1
OpenCV: 4.6.0
MMEngine: 0.3.2
MMCV: 2.0.0rc3
MMDetection: 3.0.0rc4
MMYOLO: 0.2.0+27487fd

In the commet of this blog, one says an easy downgrade of pytorch and cuda from pytorch1.13 cu117 to pytorch1.7 cu110 could solve this problem.

Modification

move matching_matrix to the same device of _from_which_layer

…dexed tensor

CLAassistant · 2022-12-20T15:09:57Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

zhubochao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

hhaAndroid · 2022-12-21T02:10:31Z

@zhubochao please fix lint and sign CLA

zhubochao

pre-commit fixed lint error

RangeKing · 2022-12-22T03:10:53Z

Hi @zhubochao, thanks for your kind PR. Could you please click the badge to sign the CLA so that we could merge this PR.

mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py

[FIX] Indices should be either on cpu or on the same device as the in…

c15f7ad

…dexed tensor

zhubochao changed the base branch from main to dev December 20, 2022 15:23

fix lint error

8045a86

zhubochao commented Dec 21, 2022

View reviewed changes

PeterH0323 changed the title ~~[FIX] Indices should be either on cpu or on the same device as the in…~~ [Fix] Indices should be either on cpu or on the same device as the in… Dec 21, 2022

hhaAndroid reviewed Dec 26, 2022

View reviewed changes

mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py Outdated Show resolved Hide resolved

zhubochao added 3 commits December 26, 2022 23:07

use new_full instead of simple torch.ones

0964bc3

fix lint error

56175ad

fix type error

1678961

hhaAndroid reviewed Dec 27, 2022

View reviewed changes

mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py Outdated Show resolved Hide resolved

mmyolo/models/task_modules/assigners/batch_yolov7_assigner.py Outdated Show resolved Hide resolved

use existed tensor to create new_full tensor

29425f2

zhubochao requested a review from hhaAndroid December 27, 2022 03:47

hhaAndroid changed the title ~~[Fix] Indices should be either on cpu or on the same device as the in…~~ Fix the device inconsistency error in yolov7 training Dec 27, 2022

hhaAndroid approved these changes Dec 27, 2022

View reviewed changes

hhaAndroid merged commit 6c5acd2 into open-mmlab:dev Dec 27, 2022

zhubochao deleted the zbc/fix_yolov7_assigner branch January 6, 2023 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the device inconsistency error in yolov7 training #397

Fix the device inconsistency error in yolov7 training #397

zhubochao commented Dec 20, 2022

CLAassistant commented Dec 20, 2022

hhaAndroid commented Dec 21, 2022

zhubochao left a comment

RangeKing commented Dec 22, 2022

Fix the device inconsistency error in yolov7 training #397

Fix the device inconsistency error in yolov7 training #397

Conversation

zhubochao commented Dec 20, 2022

Motivation

Modification

CLAassistant commented Dec 20, 2022

hhaAndroid commented Dec 21, 2022

zhubochao left a comment

Choose a reason for hiding this comment

RangeKing commented Dec 22, 2022