Skip to content

Add logging for learning rates in MetricsProcessor#1413

Merged
tianyu-l merged 4 commits intopytorch:mainfrom
idoh:log-lr
Jul 31, 2025
Merged

Add logging for learning rates in MetricsProcessor#1413
tianyu-l merged 4 commits intopytorch:mainfrom
idoh:log-lr

Conversation

@idoh
Copy link
Contributor

@idoh idoh commented Jul 18, 2025

This PR adds learning rate logging. There was a previous attempt to implement this in an earlier PR, but that one was ultimately closed. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in another PR.
image

One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR.

Given that, the logging works as follows:

  • If there is only one learning rate, it gets logged directly under the main charts as lr.
  • If there are multiple learning rates, they are logged under a separate section, each with its corresponding label.

Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code.

Happy to adjust if others have a strong preference for simplicity over robustness.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 18, 2025
@idoh
Copy link
Contributor Author

idoh commented Jul 21, 2025

@wwwjn I think this can be super useful for researchers as the convergence curves and learning rate are highly correlated and this helps visualize the learning rate. Are there any additional checks you would like me to run?

@wwwjn
Copy link
Contributor

wwwjn commented Jul 21, 2025

Thanks for making this PR! This PR look good to me, will let @tianyu-l to take another look before merging

@idoh
Copy link
Contributor Author

idoh commented Jul 26, 2025

Hey @tianyu-l, I noticed that there is a new PR on this same issue, so I'm guessing that many users need this feature. Your feedback and help in pushing this PR would be greatly appreciated.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay in reviewing

Let's just log self.lr_schedulers.schedulers[0].get_last_lr()[0].
Currently in torchtitan

  • schedulers[0] gives the lr scheduler for the model part from first PP stage (if there are multiple model parts).
  • get_last_lr()[0] gives the first lr corresponding to the first optimizer group within an optimizer, which currently each model part only has one.

As you have noted, across all optimizers they are always the same, so let's be simple -- users can always extend https://github.com/pytorch/torchtitan/blob/main/torchtitan/protocols/train_spec.py#L55 to do fancier things.

This is also consistent with current checkpoint behavior on LR scheduler:
https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/lr_scheduler.py#L68

A caution is that: since the logging step is happening after schedulers.step() call, my understanding is that on step i, what's being logged is step i+1's learning rate, instead of step i's.

@idoh
Copy link
Contributor Author

idoh commented Jul 28, 2025

Hey @tianyu-l, I have updated the PR by simplifying the learning rate logging to only the first one. It's a good point regarding the i+1 vs i step, I did not address this in the commit, let me know if you think it should be addressed.

As for the logging itself, I believe it makes more sense to keep the log function minimal, without adding new arguments to the function. When I was thinking about either adding a new argument to log I found that there is already another call in the code that uses this log function, and apparently it is already broken because it is missing the recently added grad_norm argument. So, I think these additional arguments should just be passed to the extra_metrics. The tradeoff here is that then it becomes less straightforward on the logging for the progress outputs.

Which logging design would you prefer? If we go with adding arguments to log then it would require updating it across the codebase in all the places that use log. However, if we intend to only have a single place that calls log and all the other train files should be deprecated then I think we should directly pass the arguments to log. Let me know what you prefer and I can update the code accordingly across the repository.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add a note in the comment.

When I was thinking about either adding a new argument to log I found that there is already another call in the code that uses this log function, and apparently it is already broken because it is missing the recently added grad_norm argument. So, I think these additional arguments should just be passed to the extra_metrics. The tradeoff here is that then it becomes less straightforward on the logging for the progress outputs.

ignore that

@@ -290,6 +290,7 @@ def __init__(self, job_config: JobConfig):
)
)
self.metrics_processor.optimizers = self.optimizers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry our ask was to revert this line, not the others.
I think we should still log the lr on step i.
You can just update this line to do that
https://github.com/pytorch/torchtitan/blob/main/torchtitan/train.py#L501
without adding lr_schedulers to MetricsProcessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅
Let me know if you would like me to remove lr_schedulers from from metrics.py as it is not being used and probably will not be used in the near future.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tianyu-l tianyu-l merged commit cf30b29 into pytorch:main Jul 31, 2025
8 checks passed
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
This PR adds learning rate logging. There was a previous attempt to
implement this in an [earlier
PR](pytorch#937), but that one was
ultimately **closed**. This version ensures that LR logging works
properly, I verified it using the WSD scheduler that was recently added
in [another PR](pytorch#938).
<img width="1842" height="730" alt="image"
src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b"
/>

One design consideration here is that torchtitan supports multiple
optimizers and learning rate schedules, each potentially having its own
LR. However, in practice, I believe that 99.9999% of use cases will use
a single LR.

Given that, the logging works as follows:
- If there is only one learning rate, it gets logged directly under the
main charts as `lr`.
- If there are multiple learning rates, they are logged under a separate
section, each with its corresponding label.

Alternatively, we could have ignored the multi-LR case and always logged
a single LR, but I prefer this approach since it handles both scenarios
robustly with minimal extra code.

Happy to adjust if others have a strong preference for simplicity over
robustness.
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
This PR adds learning rate logging. There was a previous attempt to
implement this in an [earlier
PR](pytorch#937), but that one was
ultimately **closed**. This version ensures that LR logging works
properly, I verified it using the WSD scheduler that was recently added
in [another PR](pytorch#938).
<img width="1842" height="730" alt="image"
src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b"
/>

One design consideration here is that torchtitan supports multiple
optimizers and learning rate schedules, each potentially having its own
LR. However, in practice, I believe that 99.9999% of use cases will use
a single LR.

Given that, the logging works as follows:
- If there is only one learning rate, it gets logged directly under the
main charts as `lr`.
- If there are multiple learning rates, they are logged under a separate
section, each with its corresponding label.

Alternatively, we could have ignored the multi-LR case and always logged
a single LR, but I prefer this approach since it handles both scenarios
robustly with minimal extra code.

Happy to adjust if others have a strong preference for simplicity over
robustness.
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
This PR adds learning rate logging. There was a previous attempt to
implement this in an [earlier
PR](pytorch#937), but that one was
ultimately **closed**. This version ensures that LR logging works
properly, I verified it using the WSD scheduler that was recently added
in [another PR](pytorch#938).
<img width="1842" height="730" alt="image"
src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b"
/>

One design consideration here is that torchtitan supports multiple
optimizers and learning rate schedules, each potentially having its own
LR. However, in practice, I believe that 99.9999% of use cases will use
a single LR.

Given that, the logging works as follows:
- If there is only one learning rate, it gets logged directly under the
main charts as `lr`.
- If there are multiple learning rates, they are logged under a separate
section, each with its corresponding label.

Alternatively, we could have ignored the multi-LR case and always logged
a single LR, but I prefer this approach since it handles both scenarios
robustly with minimal extra code.

Happy to adjust if others have a strong preference for simplicity over
robustness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants