Add logging for learning rates in MetricsProcessor by idoh · Pull Request #1413 · pytorch/torchtitan

idoh · 2025-07-18T10:07:17Z

This PR adds learning rate logging. There was a previous attempt to implement this in an earlier PR, but that one was ultimately closed. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in another PR.

One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR.

Given that, the logging works as follows:

If there is only one learning rate, it gets logged directly under the main charts as lr.
If there are multiple learning rates, they are logged under a separate section, each with its corresponding label.

Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code.

Happy to adjust if others have a strong preference for simplicity over robustness.

idoh · 2025-07-21T11:46:35Z

@wwwjn I think this can be super useful for researchers as the convergence curves and learning rate are highly correlated and this helps visualize the learning rate. Are there any additional checks you would like me to run?

wwwjn · 2025-07-21T16:33:57Z

Thanks for making this PR! This PR look good to me, will let @tianyu-l to take another look before merging

idoh · 2025-07-26T00:02:36Z

Hey @tianyu-l, I noticed that there is a new PR on this same issue, so I'm guessing that many users need this feature. Your feedback and help in pushing this PR would be greatly appreciated.

tianyu-l

sorry for the delay in reviewing

Let's just log self.lr_schedulers.schedulers[0].get_last_lr()[0].
Currently in torchtitan

schedulers[0] gives the lr scheduler for the model part from first PP stage (if there are multiple model parts).
get_last_lr()[0] gives the first lr corresponding to the first optimizer group within an optimizer, which currently each model part only has one.

As you have noted, across all optimizers they are always the same, so let's be simple -- users can always extend https://github.com/pytorch/torchtitan/blob/main/torchtitan/protocols/train_spec.py#L55 to do fancier things.

This is also consistent with current checkpoint behavior on LR scheduler:
https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/lr_scheduler.py#L68

A caution is that: since the logging step is happening after schedulers.step() call, my understanding is that on step i, what's being logged is step i+1's learning rate, instead of step i's.

idoh · 2025-07-28T09:58:40Z

Hey @tianyu-l, I have updated the PR by simplifying the learning rate logging to only the first one. It's a good point regarding the i+1 vs i step, I did not address this in the commit, let me know if you think it should be addressed.

As for the logging itself, I believe it makes more sense to keep the log function minimal, without adding new arguments to the function. When I was thinking about either adding a new argument to log I found that there is already another call in the code that uses this log function, and apparently it is already broken because it is missing the recently added grad_norm argument. So, I think these additional arguments should just be passed to the extra_metrics. The tradeoff here is that then it becomes less straightforward on the logging for the progress outputs.

Which logging design would you prefer? If we go with adding arguments to log then it would require updating it across the codebase in all the places that use log. However, if we intend to only have a single place that calls log and all the other train files should be deprecated then I think we should directly pass the arguments to log. Let me know what you prefer and I can update the code accordingly across the repository.

tianyu-l

LGTM, please add a note in the comment.

When I was thinking about either adding a new argument to log I found that there is already another call in the code that uses this log function, and apparently it is already broken because it is missing the recently added grad_norm argument. So, I think these additional arguments should just be passed to the extra_metrics. The tradeoff here is that then it becomes less straightforward on the logging for the progress outputs.

ignore that

torchtitan/components/metrics.py

torchtitan/train.py

tianyu-l · 2025-07-29T22:00:06Z

torchtitan/train.py

@@ -290,6 +290,7 @@ def __init__(self, job_config: JobConfig):
            )
        )
        self.metrics_processor.optimizers = self.optimizers


Sorry our ask was to revert this line, not the others.
I think we should still log the lr on step i.
You can just update this line to do that
https://github.com/pytorch/torchtitan/blob/main/torchtitan/train.py#L501
without adding lr_schedulers to MetricsProcessor.

Done ✅
Let me know if you would like me to remove lr_schedulers from from metrics.py as it is not being used and probably will not be used in the near future.

…ning rate

tianyu-l

LGTM!

This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.

idoh requested review from fegin, tianyu-l, wconstab and wwwjn as code owners July 18, 2025 10:07

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 18, 2025

idoh force-pushed the log-lr branch from 00ea845 to 74a9a07 Compare July 18, 2025 10:12

TJ-Solergibert mentioned this pull request Jul 24, 2025

add lr logging #1453

Open

tianyu-l requested changes Jul 28, 2025

View reviewed changes

tianyu-l approved these changes Jul 28, 2025

View reviewed changes

torchtitan/components/metrics.py Outdated Show resolved Hide resolved

idoh force-pushed the log-lr branch from 92f66d4 to acfa939 Compare July 28, 2025 21:52

tianyu-l reviewed Jul 28, 2025

View reviewed changes

torchtitan/components/metrics.py Outdated Show resolved Hide resolved

torchtitan/components/metrics.py Outdated Show resolved Hide resolved

torchtitan/train.py Show resolved Hide resolved

idoh force-pushed the log-lr branch from acfa939 to dc3d048 Compare July 29, 2025 20:22

tianyu-l reviewed Jul 29, 2025

View reviewed changes

idoh added 3 commits July 30, 2025 19:07

Add logging for learning rates in MetricsProcessor

da9fc6a

Simplify learning rate logging to log only the first scheduler's lear…

b5e5b2f

…ning rate

Add note to clarify learning rate logging behavior in MetricsProcessor

f37a69e

idoh force-pushed the log-lr branch from 011caa1 to f37a69e Compare July 30, 2025 19:08

Refactor learning rate logging in MetricsProcessor and Trainer

69916b6

idoh force-pushed the log-lr branch from bb03634 to 69916b6 Compare July 30, 2025 19:25

tianyu-l approved these changes Jul 30, 2025

View reviewed changes

tianyu-l merged commit cf30b29 into pytorch:main Jul 31, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging for learning rates in MetricsProcessor#1413

Add logging for learning rates in MetricsProcessor#1413
tianyu-l merged 4 commits intopytorch:mainfrom
idoh:log-lr

idoh commented Jul 18, 2025

Uh oh!

idoh commented Jul 21, 2025 •

edited

Loading

Uh oh!

wwwjn commented Jul 21, 2025

Uh oh!

idoh commented Jul 26, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

idoh commented Jul 28, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 29, 2025

Uh oh!

idoh Jul 30, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

idoh commented Jul 18, 2025

Uh oh!

idoh commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwwjn commented Jul 21, 2025

Uh oh!

idoh commented Jul 26, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

idoh commented Jul 28, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

idoh Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

idoh commented Jul 21, 2025 •

edited

Loading