-
Notifications
You must be signed in to change notification settings - Fork 12
分布式多机多卡训练卡住,超时后报错 #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
分布式训练卡住的原因比较复杂,但主要原因都是在分布式通信时(例如这里的AllReduce操作),不同节点的数据不匹配而无法同步,某些进程会始终等不到需要的数据,直到超时退出。 一般需要在进程间同步的数据主要就是
我单机多卡对代码测试时并没有出现卡死问题,目前也没有很好的方法去定位究竟是什么原因。能否请您提供跑的数据集、训练配置、日志文件、shell输出,以及代码是在执行到以下哪一个语句卡住的:
建议您可以尝试以下解决方法:
也参考其他项目的类似issue: 但这个问题很难解决,后续我会进一步调试代码,看能否复现这个问题。如果您方便的话,可以给我邮箱([email protected])发个您的联系方式,咱们进一步沟通。 |
我也遇到了同样的问题。我是2卡并行训练,在第二轮开始时报同样的错,只能跑一个epoch |
请问题主跟您有后续联系吗?是否能解决这个问题呢?我用2卡并行复现在COCO上的实验,也是同样的报错 |
题主后面没和我联系,我也没复现出来这个问题😢能不能提供下更详细的信息,例如输出报错、train_config.py文件、pytorch版本之类的。 我用的pytorch版本是1.12.0和2.1.1,这两个版本目前都没遇到过这个问题,可以尝试用这个pytorch版本跑试一试 |
我的pytorch版本是1.11.0,train_config.py文件只改动了coco数据集的存放位置,输出报错和题主是一样的,只不过因为用了tmux窗口,没有办法完全复制过来。感谢解答!我去试试pytorch==1.12.0。大概明天能得到是否可行的结果~ |
您好!我使用pytorch 2.1.1尝试了双卡并行,还是同样的报错。我想也许不是pytorch版本的原因。如果需要的话,我还可以用pytorch1.12.0再跑一遍。 由于使用tmux窗口,能得到的报错信息只有如下几行(和题主完全一样): in launch_agent
|
tmux窗口可以向上滚动,先按 请问每次报错位置都是在第二个epoch吗,是不是每次都是训练到同一个步数的时候报错? 不知道你方便加个好友进一步沟通吗? |
感谢教学!我现在已经用pytorch1.12.0跑上了,如果还报错会用您教的方法复制报错信息的。 根据training.log,每次报错位置确实都是在第二个epoch刚开始的时候。(还没开始就断了) 我的vx是LCWHU-0823,非常欢迎您跟我进一步沟通! |
不知道为什么我搜不到这个vx号😧可能是你设置了不允许搜索添加,能不能给我邮箱发个vx二维码我加你。 如果每次都是第二个epoch还没开始就断了,应该不是数据集和模型梯度不同步的问题,否则进程应该是随机在某一轮卡住。我搜了一下相关资料,找到了几个类似的回答,这几个问题都是在某个epoch开始就卡住,你可以参考看看: bubbliiiing/faster-rcnn-pytorch#9 (comment) |
大佬,我在你的代码上加了自己的东西进行了删改,然后进行了单机多卡的训练,也是会突然卡住,卡在这accelerator.reduce(loss_dict, reduction="mean")),然后我单卡(batch_size=1和2都试过)跑不进行分布式是可以正常跑的,请问最有可能是哪里有问题呢 |
程序跑完1个epoch之后,在第二轮训练过程中卡住,超时报错了
请问这个问题大概出现在哪里?
[2024-05-09 01:12:34 accelerate.tracking]: Successfully logged to TensorBoard
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7657580d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7604ac04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7604ac3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7604ac4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f76506dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f7659e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f7659f26850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb489380d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb4386c04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb4386c3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb4386c4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fb4842dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb48da94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb48db26850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff8f5980d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff8a2ec04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff8a2ec3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff8a2ec4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff8eeadbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ff8f8294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7ff8f8326850 in /lib/x86_64-linux-gnu/libc.so.6)
[2024-05-09 01:28:29,323] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4024 closing signal SIGTERM
[2024-05-09 01:28:31,494] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 4022) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main.py FAILED
Failures:
[1]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 4023)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4023
[2]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 4025)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4025
Root Cause (first observed failure):
[0]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 4022)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4022
The text was updated successfully, but these errors were encountered: