Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long training time and no speed up with amp #1383

Closed
gathierry opened this issue Jun 16, 2022 · 2 comments
Closed

Long training time and no speed up with amp #1383

gathierry opened this issue Jun 16, 2022 · 2 comments

Comments

@gathierry
Copy link

gathierry commented Jun 16, 2022

I'm using 8 V100 to reproduce yolox-s coco training and find the training time is much slower than your shared log here. And I find that --fp16 doesn't speed up the training either. Is it because my training is cpu bound?

python -m yolox.tools.train -n yolox-s -d 8 -b 64 --fp16

2022-06-16 05:32:01 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 100/1849, mem: 5612Mb, iter_time: 0.764s, data_time: 0.007s, total_loss: 14.5, iou_loss: 4.8, l1_loss: 0.0, conf_loss: 7.9, cls_loss: 1.9, lr: 1.170e-06, size: 608, ETA: 10 days, 1:22:32
2022-06-16 05:32:09 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 110/1849, mem: 5612Mb, iter_time: 0.762s, data_time: 0.007s, total_loss: 14.8, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.1, cls_loss: 2.0, lr: 1.416e-06, size: 672, ETA: 9 days, 14:06:04
2022-06-16 05:32:16 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 120/1849, mem: 5612Mb, iter_time: 0.765s, data_time: 0.007s, total_loss: 15.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.4, cls_loss: 2.0, lr: 1.685e-06, size: 480, ETA: 9 days, 4:44:32
2022-06-16 05:32:28 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 130/1849, mem: 6030Mb, iter_time: 1.143s, data_time: 0.031s, total_loss: 21.1, iou_loss: 4.8, l1_loss: 0.0, conf_loss: 14.4, cls_loss: 1.9, lr: 1.977e-06, size: 800, ETA: 9 days, 1:17:53
2022-06-16 05:32:39 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 140/1849, mem: 6030Mb, iter_time: 1.142s, data_time: 0.015s, total_loss: 17.7, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 11.2, cls_loss: 1.8, lr: 2.293e-06, size: 768, ETA: 8 days, 22:20:01
2022-06-16 05:32:50 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 150/1849, mem: 6030Mb, iter_time: 1.106s, data_time: 0.009s, total_loss: 16.3, iou_loss: 4.8, l1_loss: 0.0, conf_loss: 9.5, cls_loss: 2.0, lr: 2.633e-06, size: 736, ETA: 8 days, 19:24:10
2022-06-16 05:32:58 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 160/1849, mem: 6030Mb, iter_time: 0.763s, data_time: 0.037s, total_loss: 14.4, iou_loss: 4.8, l1_loss: 0.0, conf_loss: 7.7, cls_loss: 1.9, lr: 2.995e-06, size: 480, ETA: 8 days, 13:32:08
2022-06-16 05:33:09 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 170/1849, mem: 6030Mb, iter_time: 1.122s, data_time: 0.039s, total_loss: 21.1, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 14.2, cls_loss: 2.2, lr: 3.381e-06, size: 704, ETA: 8 days, 11:36:45
2022-06-16 05:33:18 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 180/1849, mem: 6030Mb, iter_time: 0.874s, data_time: 0.024s, total_loss: 15.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.4, cls_loss: 1.8, lr: 3.791e-06, size: 512, ETA: 8 days, 7:46:40
2022-06-16 05:33:26 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 190/1849, mem: 6030Mb, iter_time: 0.792s, data_time: 0.038s, total_loss: 17.8, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 11.0, cls_loss: 2.2, lr: 4.224e-06, size: 736, ETA: 8 days, 3:40:59
2022-06-16 05:33:33 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 200/1849, mem: 6030Mb, iter_time: 0.762s, data_time: 0.048s, total_loss: 14.2, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 7.7, cls_loss: 1.8, lr: 4.680e-06, size: 544, ETA: 7 days, 23:45:56
python -m yolox.tools.train -n yolox-s -d 8 -b 64

2022-06-16 05:51:34 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 100/1849, mem: 7420Mb, iter_time: 0.759s, data_time: 0.011s, total_loss: 13.5, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 6.6, cls_loss: 2.2, lr: 1.170e-06, size: 480, ETA: 9 days, 18:18:25
2022-06-16 05:51:45 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 110/1849, mem: 7420Mb, iter_time: 1.013s, data_time: 0.029s, total_loss: 22.6, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 16.0, cls_loss: 1.8, lr: 1.416e-06, size: 768, ETA: 9 days, 11:11:40
2022-06-16 05:51:53 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 120/1849, mem: 7420Mb, iter_time: 0.828s, data_time: 0.038s, total_loss: 19.7, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 12.3, cls_loss: 2.8, lr: 1.685e-06, size: 768, ETA: 9 days, 2:53:32
2022-06-16 05:52:01 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 130/1849, mem: 7420Mb, iter_time: 0.820s, data_time: 0.004s, total_loss: 18.6, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 12.0, cls_loss: 1.8, lr: 1.977e-06, size: 640, ETA: 8 days, 19:46:08
2022-06-16 05:52:11 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 140/1849, mem: 7420Mb, iter_time: 0.974s, data_time: 0.032s, total_loss: 14.7, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.0, cls_loss: 2.0, lr: 2.293e-06, size: 672, ETA: 8 days, 15:21:03
2022-06-16 05:52:19 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 150/1849, mem: 7420Mb, iter_time: 0.776s, data_time: 0.043s, total_loss: 15.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.4, cls_loss: 1.9, lr: 2.633e-06, size: 576, ETA: 8 days, 9:29:18
2022-06-16 05:52:27 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 160/1849, mem: 7420Mb, iter_time: 0.807s, data_time: 0.026s, total_loss: 18.7, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 11.6, cls_loss: 2.5, lr: 2.995e-06, size: 800, ETA: 8 days, 4:39:29
2022-06-16 05:52:35 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 170/1849, mem: 7420Mb, iter_time: 0.803s, data_time: 0.035s, total_loss: 17.5, iou_loss: 4.8, l1_loss: 0.0, conf_loss: 11.0, cls_loss: 1.7, lr: 3.381e-06, size: 672, ETA: 8 days, 0:21:52
2022-06-16 05:52:43 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 180/1849, mem: 7420Mb, iter_time: 0.811s, data_time: 0.033s, total_loss: 14.7, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 7.9, cls_loss: 2.2, lr: 3.791e-06, size: 576, ETA: 7 days, 20:37:03
2022-06-16 05:52:50 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 190/1849, mem: 7420Mb, iter_time: 0.765s, data_time: 0.020s, total_loss: 14.0, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 7.2, cls_loss: 2.2, lr: 4.224e-06, size: 480, ETA: 7 days, 16:53:34
2022-06-16 05:52:59 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 200/1849, mem: 7420Mb, iter_time: 0.808s, data_time: 0.011s, total_loss: 16.5, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 10.0, cls_loss: 1.9, lr: 4.680e-06, size: 640, ETA: 7 days, 13:51:54

Also tested with 4 GPUs

python -m yolox.tools.train -n yolox-s -d 4 -b 64 --fp16

2022-06-16 05:43:34 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 100/1849, mem: 7527Mb, iter_time: 0.832s, data_time: 0.004s, total_loss: 19.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 12.2, cls_loss: 2.1, lr: 1.170e-06, size: 736, ETA: 10 days, 5:40:25
2022-06-16 05:43:43 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 110/1849, mem: 7527Mb, iter_time: 0.873s, data_time: 0.036s, total_loss: 18.7, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 11.8, cls_loss: 2.2, lr: 1.416e-06, size: 800, ETA: 9 days, 19:34:05
2022-06-16 05:43:53 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 120/1849, mem: 7527Mb, iter_time: 0.973s, data_time: 0.036s, total_loss: 15.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.3, cls_loss: 2.0, lr: 1.685e-06, size: 544, ETA: 9 days, 12:25:19
2022-06-16 05:44:02 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 130/1849, mem: 7527Mb, iter_time: 0.886s, data_time: 0.032s, total_loss: 17.6, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 10.6, cls_loss: 2.4, lr: 1.977e-06, size: 800, ETA: 9 days, 5:21:01
2022-06-16 05:44:10 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 140/1849, mem: 7527Mb, iter_time: 0.811s, data_time: 0.031s, total_loss: 16.0, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 9.3, cls_loss: 2.0, lr: 2.293e-06, size: 704, ETA: 8 days, 22:27:48
2022-06-16 05:44:18 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 150/1849, mem: 7527Mb, iter_time: 0.845s, data_time: 0.036s, total_loss: 16.2, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 9.4, cls_loss: 2.1, lr: 2.633e-06, size: 800, ETA: 8 days, 16:50:28
2022-06-16 05:44:26 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 160/1849, mem: 7527Mb, iter_time: 0.758s, data_time: 0.026s, total_loss: 14.7, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 8.1, cls_loss: 2.0, lr: 2.995e-06, size: 544, ETA: 8 days, 11:05:04
2022-06-16 05:44:34 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 170/1849, mem: 7527Mb, iter_time: 0.821s, data_time: 0.021s, total_loss: 18.5, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 11.6, cls_loss: 2.3, lr: 3.381e-06, size: 704, ETA: 8 days, 6:34:39
2022-06-16 05:44:42 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 180/1849, mem: 7527Mb, iter_time: 0.820s, data_time: 0.020s, total_loss: 16.0, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 9.1, cls_loss: 2.3, lr: 3.791e-06, size: 640, ETA: 8 days, 2:33:38
2022-06-16 05:44:50 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 190/1849, mem: 7527Mb, iter_time: 0.820s, data_time: 0.024s, total_loss: 15.6, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 8.8, cls_loss: 2.1, lr: 4.224e-06, size: 736, ETA: 7 days, 22:58:02
2022-06-16 05:44:58 | INFO     | yolox.core.trainer:261 - epoch: 1/300, iter: 200/1849, mem: 7527Mb, iter_time: 0.707s, data_time: 0.022s, total_loss: 13.6, iou_loss: 4.7, l1_loss: 0.0, conf_loss: 7.0, cls_loss: 2.0, lr: 4.680e-06, size: 480, ETA: 7 days, 18:51:24
@Joker316701882
Copy link
Member

@gathierry Your data_time is too long. Try our latest PR #1584 with the updated --cache function!

@LordonCN
Copy link

I use the latest code but it also too slow.
Screen Shot 2023-10-20 at 11 34 01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants