Skip to content

[training_utils] feat: impl a metrics class for recording metrics and support worker sync [WIP]#2308

Draft
0x404 wants to merge 2 commits intoverl-project:mainfrom
0x404:metrics
Draft

[training_utils] feat: impl a metrics class for recording metrics and support worker sync [WIP]#2308
0x404 wants to merge 2 commits intoverl-project:mainfrom
0x404:metrics

Conversation

@0x404
Copy link
Copy Markdown
Collaborator

@0x404 0x404 commented Jul 1, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented Jul 1, 2025

Hi @vermouth1992, I found the approach we discussed in #2259 (comment) is somewhat tricky. We need more discussion to support aggregated_data for different dispatch_fn and collect_fn. And since we are going to deprecate DataProto in the future refactor, I think we shouldn't support aggregated_data for it anymore.

I think we could support a generic Metrics class, and we can set a config for each recorded metric, including the reduce type it should use, and currently also including the format it should use (this is to support console log, since currently each metric can only support the default "{:.3f}", so learning rate always shows as 0.000). We can support synchronization and reduce between different ranks for this Metrics class, and can use this util to replace the currently used dict for recording metrics.

This PR is still work in progress. Do you think this approach is reasonable? It would be great to hear your insights.

@@ -0,0 +1,342 @@
from collections import defaultdict
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to we want to aggregate among dp?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need a function to merge them? Actually, I guess allgather is a good choice because this is what typically done in pretraining.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of creating a function allgather_across_group(self, group) for Metrics class, which would gather all other ranks' metrics objects on each dp rank, then merge them into a single Metrics through Metrics.merge.

What do you think? This PR is still a draft demo, if this makes sense, I will keep working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants