[RFC] Add lr into metric logging and also rename the loss name #939

fduwjj · 2025-03-06T19:36:43Z

During some experiments we realized that we need to log lr into metrics so that we can ensure things are correct.

somehow the name of loss does not match with the run we want to compare. So I am wondering if we can build a metric update util so that we can set and change it for different workload?

For example changing from loss_metrics/global_avg_loss to train/avg_loss_AVG

tianyu-l

fyi it seems similar things (with bigger change) are being done in #938

Maybe we can suggest changes over there?

fduwjj · 2025-03-06T19:54:03Z

@tianyu-l this is different. Mostly this is for metric improvement. And we also need cosine lr as well (I just haven't got time to do that but it is ok) To me these are separate topic and the author is using lr_min.

torchtitan/train.py

wconstab · 2025-03-06T20:06:32Z

torchtitan/tools/metrics.py

    return device_memory_monitor


+def metric_processor(metrics: Dict[str, Any]):


hmm. maybe we can do this offline in a util if we want to rename metrics? (i assume that's what this is for?)

Well this is to update it offline but we need a hook in train.py.

what i meant by offline is, run a script that processes the .tb file. I don't know if changing the name of the written metrics is a common enough use case to justify adding a hook into torchtitan. Perhaps its OK to just use some inconvenient methods to generate the comparison plots between torchtitan and an incompatible trainer? idk... i see why you want this though.

fduwjj · 2025-03-07T21:08:02Z

Will adopt what is done here: #945

[RFC] Add lr into metric logging and also rename the loss name

fbf8f27

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 6, 2025

tianyu-l reviewed Mar 6, 2025

View reviewed changes

Add implementation

003aa07

fduwjj requested review from H-Huang, fegin, tianyu-l and wconstab March 6, 2025 19:55

fduwjj marked this pull request as ready for review March 6, 2025 19:57

wconstab reviewed Mar 6, 2025

View reviewed changes

torchtitan/train.py Show resolved Hide resolved

wconstab reviewed Mar 6, 2025

View reviewed changes

fduwjj closed this Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add lr into metric logging and also rename the loss name #939

[RFC] Add lr into metric logging and also rename the loss name #939

Uh oh!

fduwjj commented Mar 6, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

fduwjj commented Mar 6, 2025

Uh oh!

Uh oh!

wconstab Mar 6, 2025

Uh oh!

fduwjj Mar 6, 2025

Uh oh!

wconstab Mar 7, 2025

Uh oh!

fduwjj commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		return device_memory_monitor


		def metric_processor(metrics: Dict[str, Any]):

[RFC] Add lr into metric logging and also rename the loss name #939

[RFC] Add lr into metric logging and also rename the loss name #939

Uh oh!

Conversation

fduwjj commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Mar 6, 2025

Uh oh!

Uh oh!

wconstab Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Mar 6, 2025 •

edited

Loading