Add logging for learning rates in MetricsProcessor#1413
Add logging for learning rates in MetricsProcessor#1413tianyu-l merged 4 commits intopytorch:mainfrom
Conversation
|
@wwwjn I think this can be super useful for researchers as the convergence curves and learning rate are highly correlated and this helps visualize the learning rate. Are there any additional checks you would like me to run? |
|
Thanks for making this PR! This PR look good to me, will let @tianyu-l to take another look before merging |
tianyu-l
left a comment
There was a problem hiding this comment.
sorry for the delay in reviewing
Let's just log self.lr_schedulers.schedulers[0].get_last_lr()[0].
Currently in torchtitan
schedulers[0]gives the lr scheduler for the model part from first PP stage (if there are multiple model parts).get_last_lr()[0]gives the first lr corresponding to the first optimizer group within an optimizer, which currently each model part only has one.
As you have noted, across all optimizers they are always the same, so let's be simple -- users can always extend https://github.com/pytorch/torchtitan/blob/main/torchtitan/protocols/train_spec.py#L55 to do fancier things.
This is also consistent with current checkpoint behavior on LR scheduler:
https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/lr_scheduler.py#L68
A caution is that: since the logging step is happening after schedulers.step() call, my understanding is that on step i, what's being logged is step i+1's learning rate, instead of step i's.
|
Hey @tianyu-l, I have updated the PR by simplifying the learning rate logging to only the first one. It's a good point regarding the As for the logging itself, I believe it makes more sense to keep the Which logging design would you prefer? If we go with adding arguments to |
tianyu-l
left a comment
There was a problem hiding this comment.
LGTM, please add a note in the comment.
When I was thinking about either adding a new argument to log I found that there is already another call in the code that uses this log function, and apparently it is already broken because it is missing the recently added grad_norm argument. So, I think these additional arguments should just be passed to the extra_metrics. The tradeoff here is that then it becomes less straightforward on the logging for the progress outputs.
ignore that
| @@ -290,6 +290,7 @@ def __init__(self, job_config: JobConfig): | |||
| ) | |||
| ) | |||
| self.metrics_processor.optimizers = self.optimizers | |||
There was a problem hiding this comment.
Sorry our ask was to revert this line, not the others.
I think we should still log the lr on step i.
You can just update this line to do that
https://github.com/pytorch/torchtitan/blob/main/torchtitan/train.py#L501
without adding lr_schedulers to MetricsProcessor.
There was a problem hiding this comment.
Done ✅
Let me know if you would like me to remove lr_schedulers from from metrics.py as it is not being used and probably will not be used in the near future.
This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.
This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.
This PR adds learning rate logging. There was a previous attempt to implement this in an [earlier PR](pytorch#937), but that one was ultimately **closed**. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in [another PR](pytorch#938). <img width="1842" height="730" alt="image" src="https://github.com/user-attachments/assets/8f23674a-d689-4cc2-9d9b-30bff4e63f3b" /> One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR. Given that, the logging works as follows: - If there is only one learning rate, it gets logged directly under the main charts as `lr`. - If there are multiple learning rates, they are logged under a separate section, each with its corresponding label. Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code. Happy to adjust if others have a strong preference for simplicity over robustness.
This PR adds learning rate logging. There was a previous attempt to implement this in an earlier PR, but that one was ultimately closed. This version ensures that LR logging works properly, I verified it using the WSD scheduler that was recently added in another PR.

One design consideration here is that torchtitan supports multiple optimizers and learning rate schedules, each potentially having its own LR. However, in practice, I believe that 99.9999% of use cases will use a single LR.
Given that, the logging works as follows:
lr.Alternatively, we could have ignored the multi-LR case and always logged a single LR, but I prefer this approach since it handles both scenarios robustly with minimal extra code.
Happy to adjust if others have a strong preference for simplicity over robustness.