[training_utils] feat: impl a metrics class for recording metrics and support worker sync [WIP]#2308
[training_utils] feat: impl a metrics class for recording metrics and support worker sync [WIP]#23080x404 wants to merge 2 commits intoverl-project:mainfrom
Conversation
|
Hi @vermouth1992, I found the approach we discussed in #2259 (comment) is somewhat tricky. We need more discussion to support I think we could support a generic This PR is still work in progress. Do you think this approach is reasonable? It would be great to hear your insights. |
| @@ -0,0 +1,342 @@ | |||
| from collections import defaultdict | |||
There was a problem hiding this comment.
How to we want to aggregate among dp?
There was a problem hiding this comment.
I guess we need a function to merge them? Actually, I guess allgather is a good choice because this is what typically done in pretraining.
There was a problem hiding this comment.
Yes, I was thinking of creating a function allgather_across_group(self, group) for Metrics class, which would gather all other ranks' metrics objects on each dp rank, then merge them into a single Metrics through Metrics.merge.
What do you think? This PR is still a draft demo, if this makes sense, I will keep working on this.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisHigh-Level Design
Specific Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace.