Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning rate for training #21

Open
jacksonsc007 opened this issue Jul 10, 2024 · 3 comments
Open

Learning rate for training #21

jacksonsc007 opened this issue Jul 10, 2024 · 3 comments
Labels
question Further information is requested

Comments

@jacksonsc007
Copy link

Question

Hi, @xiuqhou
Thanks for your enlightening work. I came across some questions while reproducing your work.

  1. How many gpus did you use to train the model?

  2. Need I change the initial learning rate if I adopt different total batchsize (num_gpus * batchsize_per_gpu). Is there a policy to make final performance impermeable to the total batchsize?

In my personal experiments, model performance are not consistent among different total_batch_size. I experimented with 1x2 (1gpu, 2 images per gpu) and 4x4 (4gpus, 4 images per gpu) settings, and the initial learning rate is same. But results show that there is a non-trivial gaps between them. (4x4 setting lags behind 1x2 setting with 2 AP)

Best regards

Additional

No response

@jacksonsc007 jacksonsc007 added the question Further information is requested label Jul 10, 2024
@xiuqhou
Copy link
Owner

xiuqhou commented Jul 10, 2024

Hello @jacksonsc007 , thank you for your question.

  1. We use 2 * A800 gpus to train the model. The batch_size on each gpu is 5, so the total batch_size is 10. The learning rate is set to 1e-4.
  2. There are two policies to change the learning rate according to the total batchsize. If batch_size increases K times, you can set lr to sqrt(k) times to keep the variance unchanged, or set it to k times according to the linear scaling raw. In practice, the latter is more commonly used.

We use lr=1e-4 for total_batch_size=10, so you should use lr=1.6e-4 for total_batch_size=16 and lr=2e-5 for total_batch_size=2 to achieve a close performance.

@jacksonsc007
Copy link
Author

Thanks for your prompt reply. I will try your suggestions and report the results later.

By the way, could you show me the relative papers for the learning rate rule you just mentioned?

@xiuqhou
Copy link
Owner

xiuqhou commented Jul 10, 2024

The linear scaling rule is mentioned in this paper: https://arxiv.org/abs/1706.02677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants