Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何进行断点续训 #7

Open
Xiaofei-Kevin-Yang opened this issue May 10, 2024 · 2 comments
Open

如何进行断点续训 #7

Xiaofei-Kevin-Yang opened this issue May 10, 2024 · 2 comments

Comments

@Xiaofei-Kevin-Yang
Copy link

您好,非常感谢您开源这项工作,目前我在训练silence detr with focalnet as the backbone, 发现两个问题:1 cuda out of memory. 我采用3090来训练,并且batch_size=2。2 如何进行断点续训, 我尝试在 train_config文件中设置“resume_from_checkpoint = /hy-tmp/2024-05-09-22_34_34/best_ap.pth”提示语法错误。谢谢您的帮助。

@xiuqhou
Copy link
Owner

xiuqhou commented May 10, 2024

您好,FocalNet-large模型本身比较大,3090显卡的24G显存不足以支撑基于FocalNet-large的Salience-DETR在batch_size=2的条件下进行模型训练,您可以尝试:

  1. 使用fp16或者bf16混合精度训练以降低显存,只需在训练命令后增加--mixed-precision
accelerate launch main.py --mixed-precision bf16  # bf16混合精度训练,需显卡及PyTorch支持
#
accelerate launch main.py --mixed-precision fp16  # fp16混合精度训练
  1. 将batch_size设置为1(建议也将学习率等比缩放为5e-5,以保持性能的稳定)。

  2. 使用其他小backbone的Salience-DETR,例如salience_detr_resnet50_800_1333.pysalience_detr_swin_l_800_1333.pysalience_detr_convnext_l_800_1333.py

  3. 若有条件可以更换大显存的显卡,从硬件层面解决out of memory问题。

@xiuqhou
Copy link
Owner

xiuqhou commented May 10, 2024

从您的回答中,断点续训提示语法错误可能是因为路径没有用引号包起来:

resume_from_checkpoint = "/hy-tmp/2024-05-09-22_34_34/best_ap.pth"

这里传递的路径可以是训练时产生的<文件夹>或<模型权重>,他们有所区别,请看例子:

  • resume_from_checkpoint = "/hy-tmp/2024-05-09-22_34_34/"
    当提供的路径是训练时的文件夹时,会从上次中断的epoch恢复训练(断点续训),训练记录依然是在原来的/hy-tmp/2024-05-09-22_34_34目录下,日志也会在原来的基础上进行更新。

  • resume_from_checkpoint = "/hy-tmp/2024-05-09-22_34_34/best_ap.pth"
    当提供的路径是一个模型权重时,会加载这个权重,但会由epoch=0从头开始训练,产生新的输出文件夹和日志。

因此如果您想从上一次中断的训练中恢复训练,请使用resume_from_checkpoint = "/hy-tmp/2024-05-09-22_34_34/"

感谢您对本项目的关注,如果有问题欢迎继续提问~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants