[megatron] chore: clean legacy code path part 3, make megatron_worker use mbridge#4530
[megatron] chore: clean legacy code path part 3, make megatron_worker use mbridge#4530ISEEKYAN wants to merge 6 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a significant refactoring to remove a legacy code path and enforce the use of mbridge for Megatron workers. The changes are extensive, touching documentation, test configurations, and core worker logic. The legacy path, which involved per_tensor_generator and hf_to_mcore_config, has been systematically removed from megatron_worker.py, transformer_impl.py, and other related files. Assertions have been added to ensure use_mbridge is always true. The code has been simplified by removing conditional logic for the old and new paths. Additionally, some utility functions have been refactored to use upstream versions from megatron.core, improving maintainability. Overall, the changes are well-aligned with the PR's goal and improve the codebase. I have one suggestion for improving robustness in one of the new assertions.
| trust_remote_code=False, | ||
| megatron_config=None, | ||
| ): | ||
| assert megatron_config.use_mbridge, "use_mbridge must be True" |
There was a problem hiding this comment.
The function signature for _init_hf_config_and_tf_config allows megatron_config to be None by default. However, the assertion on this line directly accesses megatron_config.use_mbridge, which will raise an AttributeError if megatron_config is None. This can lead to an unhandled crash. It's more robust to check for None before accessing its attributes to provide a clearer error message.
| assert megatron_config.use_mbridge, "use_mbridge must be True" | |
| assert megatron_config is not None and megatron_config.use_mbridge, "megatron_config must be provided and use_mbridge must be True" |
What does this PR do?
this is one of a series PRs to clean the legacy megatron code path and make bridge the default path for megatron. #4496
This PR make sure the megatron_worker must use mbridge
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)