-
Notifications
You must be signed in to change notification settings - Fork 690
[RFC] Add lr into metric logging and also rename the loss name #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi it seems similar things (with bigger change) are being done in #938
Maybe we can suggest changes over there?
|
@tianyu-l this is different. Mostly this is for metric improvement. And we also need cosine lr as well (I just haven't got time to do that but it is ok) To me these are separate topic and the author is using lr_min. |
| return device_memory_monitor | ||
|
|
||
|
|
||
| def metric_processor(metrics: Dict[str, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. maybe we can do this offline in a util if we want to rename metrics? (i assume that's what this is for?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is to update it offline but we need a hook in train.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i meant by offline is, run a script that processes the .tb file. I don't know if changing the name of the written metrics is a common enough use case to justify adding a hook into torchtitan. Perhaps its OK to just use some inconvenient methods to generate the comparison plots between torchtitan and an incompatible trainer? idk... i see why you want this though.
|
Will adopt what is done here: #945 |
During some experiments we realized that we need to log lr into metrics so that we can ensure things are correct.
somehow the name of loss does not match with the run we want to compare. So I am wondering if we can build a metric update util so that we can set and change it for different workload?
For example changing from loss_metrics/global_avg_loss to train/avg_loss_AVG