Skip to content

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Mar 6, 2025

During some experiments we realized that we need to log lr into metrics so that we can ensure things are correct.

somehow the name of loss does not match with the run we want to compare. So I am wondering if we can build a metric update util so that we can set and change it for different workload?

For example changing from loss_metrics/global_avg_loss to train/avg_loss_AVG

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 6, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi it seems similar things (with bigger change) are being done in #938

Maybe we can suggest changes over there?

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 6, 2025

@tianyu-l this is different. Mostly this is for metric improvement. And we also need cosine lr as well (I just haven't got time to do that but it is ok) To me these are separate topic and the author is using lr_min.

@fduwjj fduwjj marked this pull request as ready for review March 6, 2025 19:57
return device_memory_monitor


def metric_processor(metrics: Dict[str, Any]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. maybe we can do this offline in a util if we want to rename metrics? (i assume that's what this is for?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is to update it offline but we need a hook in train.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i meant by offline is, run a script that processes the .tb file. I don't know if changing the name of the written metrics is a common enough use case to justify adding a hook into torchtitan. Perhaps its OK to just use some inconvenient methods to generate the comparison plots between torchtitan and an incompatible trainer? idk... i see why you want this though.

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 7, 2025

Will adopt what is done here: #945

@fduwjj fduwjj closed this Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants