Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paddle3D的BevFormer训练启动时报 OSError: (External) CUBLAS error(15). #422

Open
jjrCN opened this issue Nov 1, 2023 · 1 comment
Assignees

Comments

@jjrCN
Copy link

jjrCN commented Nov 1, 2023

Paddle3D的BevFormer训练启动时报
OSError: (External) CUBLAS error(15).

详细LOG:
File "/root/jiajinrang/Paddle3D/paddle3d/models/detection/bevformer/bevformer.py", line 149, in obtain_history_bev
prev_bev = self.pts_bbox_head(
File "/root/miniconda3/envs/paddle_env_1/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in call
return self.forward(*inputs, **kwargs)
File "/root/jiajinrang/Paddle3D/paddle3d/models/detection/bevformer/bevformer_head.py", line 255, in forward
bev_embed = self.transformer.get_bev_features(
File "/root/jiajinrang/Paddle3D/paddle3d/models/transformers/transformer.py", line 221, in get_bev_features
tmp_prev_bev = rotate(
File "/root/jiajinrang/Paddle3D/paddle3d/models/transformers/utils.py", line 217, in rotate
img = _rotate(img, matrix=matrix, interpolation=interpolation)
File "/root/jiajinrang/Paddle3D/paddle3d/models/transformers/utils.py", line 192, in _rotate
grid = _gen_affine_grid(theta, w=w, h=h, ow=ow, oh=oh)
File "/root/jiajinrang/Paddle3D/paddle3d/models/transformers/utils.py", line 127, in _gen_affine_grid
output_grid = base_grid.reshape([1, oh * ow, 3]).bmm(rescaled_theta)
File "/root/miniconda3/envs/paddle_env_1/lib/python3.8/site-packages/paddle/tensor/linalg.py", line 1730, in bmm
return _C_ops.bmm(x, y)
OSError: (External) CUBLAS error(15).
[Hint: Please search for the error code(15) on website (https://docs.nvidia.com/cuda/cublas/index.html#cublasstatus_t) to get Nvidia's official solution and advice about CUBLAS Error.] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:62)

I1101 12:54:37.573464 20238 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-11-01 12:54:39,363 Exit code 1

单机环境下,无论是单卡还是多卡都报这个问题。
但是,有一个线索是,在单卡情况下,如果在train.py的main函数中加入paddle.utils.run_check(),单卡训练就能正常启动训练。但多卡情况下仍然不行。
单卡的解决方法如下所示
if name == 'main':
paddle.utils.run_check()
args = parse_args()
main(args)

请问多卡情况下该怎么解决?

其他补充信息 Additional Supplementary Information
环境信息

Paddle version: N/A
Paddle With CUDA: N/A

OS: centos 7
GCC version: (GCC) 7.5.0
Clang version: N/A
CMake version: version 2.8.12.2
Libc version: glibc 2.17
Python version: 3.8.18

CUDA version: 11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0
cuDNN version: 8.8.0
Nvidia driver version: 520.61.05
Nvidia driver List:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

@LielinJiang
Copy link
Collaborator

单独运行paddle.utils.run_check(),会提示多卡有问题嘛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants