-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training hangs due to missing keys in mmseg.segmentors.base.BaseSegmentor._parse_losses
#1030
Comments
Distributed training a DeepLabv3 config is fine. So I think it has something to do with my model. My workload is different from batch to batch. Some may take a bit more time than others depending on the data itself. The code is complicated so I cannot give you a simple example code to reproduce. Any idea about how to debug this? Thanks in advance! |
Hi, based on your limited description I can not give suggestions. I think it is caused by your customized model setting. |
Basically, the model does the following:
Everything works fine with a single GPU. When starting distributed training, it first raises the error that some parameters are unused. Then I multiply all outputs of the forward function with zero and add it to the final loss. Then I get rid of this unused parameter error, but it get stuck at backprop when one of the GPU encounters the case where y equals 0 for all RoIs. |
At the beginning, I wrote something like if y.any():
my_seg_head.forward_train(x, y) I thought this may cause the case where one GPU enters this This may be a pytorch issue, as the |
Some logs (I add the === lines manually):
We can see the forward_train pass is finished. I also tried to downgrade pytorch to 1.8, no luck. |
Now I find the reason. I should always keep the calculation workflow identical among the GPUs, including the accuracy terms. I have the codes similar to the following: if mask.any():
loss['roi_acc'] = roi_acc(feats) The reason to this is that, for images that do not have |
@MengzhangLI for loss_name, loss_value in log_vars.items():
# reduce loss when distributed training
if dist.is_available() and dist.is_initialized():
loss_value = loss_value.data.clone()
dist.all_reduce(loss_value.div_(dist.get_world_size()))
log_vars[loss_name] = loss_value.item() One GPU A does not have a A quick fix is to delete this for loss_name, loss_value in log_vars.items():
if loss_name.endswith('_dist_counter'): # e.g. roi_acc_dist_counter -> roi_acc
if dist.is_available() and dist.is_initialized():
dist_count = loss_value.data.clone()
dist.all_reduce(dist_count)
key = loss_name.replace('_dist_counter', '')
log_vars[key] *= dist.get_world_size() / dist_count.item()
del log_vars[loss_name]
else:
# reduce loss when distributed training
if dist.is_available() and dist.is_initialized():
loss_value = loss_value.data.clone()
dist.all_reduce(loss_value.div_(dist.get_world_size()))
log_vars[loss_name] = loss_value.item() For those GPUs without a |
This solution is not perfect, when all batch do not have this |
@MengzhangLI Can I pass a |
mmseg.segmentors.base.BaseSegmentor._parse_losses
Hi, sorry for late reply. Very happy to see you’ve fixed your problem basically. Hope you like our codebase. You can try to rename |
Frankly, I am not sure. Could you have a try and hoping to get your feedback! |
I know that if one key has "loss" in its loss_name, it will be involved in backprop. However, I think |
Actually I can do this, maybe when I have time, I'd create a PR. Closing this. |
That would be very cool, can’t wait to work together with you. Best, |
Also I will notice this potential bugs and we would try to fix it. Thank you very much for your excellent issue. |
I opened another issue #1034 reporting this bug. And I will also try to provide a fix for it. |
Also, there is an error in your tutorial here. You said only the losses with loss = sum(_value for _key, _value in log_vars.items()
if 'loss' in _key) suggests any losses with |
You are right. Sorry for my misunderstanding statement. |
I fixed this too. Should I also create a PR for this feature? I have pushed a PR to fix the infinite waiting. |
* map location to cpu when load checkpoint (open-mmlab#1007) * [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005) * fix typo in mobilenet_v3 * fix typo in mobilenet_v3 * use -1 to indicate output tensors from final stage * support negative out_indices * [Enhancement] inference speed and flops tools. (open-mmlab#986) * add the function to test the dummy forward speed of models. * add tools to test the flops and inference speed of multiple models. * [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014) * update mmtracking demo * support both track_bboxes and track_results * add docstring * [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010) * fix wholebody base dataset * fix lint * fix lint Co-authored-by: ly015 <[email protected]> * [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009) * add configs * add dark configs * add checkpoint and readme * update webcam demo * fix model path in webcam demo * fix unittest * [Fix] Fix bbox label visualization (open-mmlab#1020) * update model metafiles (open-mmlab#1001) * update hourglass ae .md (open-mmlab#1027) * [Feature] Add ViPNAS mbv3 (open-mmlab#1025) * add vipnas mbv3 * test other variants * submission for mmpose * add unittest * add readme * update .yml * fix lint * rebase * fix pytest Co-authored-by: jin-s13 <[email protected]> * [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030) * fix randseed * fix lint * fix import * fix isort * update yapf hook * revert yapf version * add cfg file for flops and speed test, change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028) * [Enhancement] Add more functions for speed test tool (open-mmlab#1034) * add batch size and device args in speed test script, and remove MMDataParallel warper * add vipnas_mbv3 model * fix dead link (open-mmlab#1038) * Skip CI when some specific files were changed (open-mmlab#1041) * update sigmas (open-mmlab#1040) * add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035) * [Feature] Add PoseWarper dataset (open-mmlab#1006) * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * modify some methods in the base class to improve code coverage rate * recover some mistakenly-deleted notes * remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class * cancel uncompleted previous runs (open-mmlab#1053) * [Doc] Add inference speed results (open-mmlab#1044) * add docs related to inference speed results * add corresponding Chinese docs and fix some typos * add Chinese docs in readthedocs * remove the massive table in readme * minor modification to wording Co-authored-by: ly015 <[email protected]> * [Feature] Add PoseWarper detector model (open-mmlab#932) * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add PoseWarper neck * modify PoseWarper detector and PoseWarperNeck * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * modify the posewarper detector description * modify some arguments and related fields * modify default values for some args * fix readthedoc bulid typo * fix ignore path (open-mmlab#1059) * [Doc] Add related docs for PoseWarper (open-mmlab#1036) * add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path * add description about val_boxes path for data preparation (open-mmlab#1060) * bump version to v0.21.0 (open-mmlab#1061) * [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055) * add vipnas mbv3 coco_wholebody * add vipnas mbv3 coco_wholebody md&yml * fix lint Co-authored-by: ly015 <[email protected]> Co-authored-by: Lumin <[email protected]> Co-authored-by: zengwang430521 <[email protected]> Co-authored-by: Jas <[email protected]> Co-authored-by: jin-s13 <[email protected]> Co-authored-by: Qikai Li <[email protected]> Co-authored-by: QwQ2000 <[email protected]>
When training on multiple GPUs, my code of customized model get stuck. When training on only one GPU, it works good. Ctrl+C gives me the following error stack:
I cannot find many useful information online. Any advices on how to debug further?
Environment:
The text was updated successfully, but these errors were encountered: