Skip to content

[megatron] chore: clean legacy code path part 3, make megatron_worker use mbridge#4530

Open
ISEEKYAN wants to merge 6 commits intoverl-project:mainfrom
ISEEKYAN:mcore_clean_part3
Open

[megatron] chore: clean legacy code path part 3, make megatron_worker use mbridge#4530
ISEEKYAN wants to merge 6 commits intoverl-project:mainfrom
ISEEKYAN:mcore_clean_part3

Conversation

@ISEEKYAN
Copy link
Collaborator

What does this PR do?

this is one of a series PRs to clean the legacy megatron code path and make bridge the default path for megatron. #4496

This PR make sure the megatron_worker must use mbridge

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring to remove a legacy code path and enforce the use of mbridge for Megatron workers. The changes are extensive, touching documentation, test configurations, and core worker logic. The legacy path, which involved per_tensor_generator and hf_to_mcore_config, has been systematically removed from megatron_worker.py, transformer_impl.py, and other related files. Assertions have been added to ensure use_mbridge is always true. The code has been simplified by removing conditional logic for the old and new paths. Additionally, some utility functions have been refactored to use upstream versions from megatron.core, improving maintainability. Overall, the changes are well-aligned with the PR's goal and improve the codebase. I have one suggestion for improving robustness in one of the new assertions.

trust_remote_code=False,
megatron_config=None,
):
assert megatron_config.use_mbridge, "use_mbridge must be True"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function signature for _init_hf_config_and_tf_config allows megatron_config to be None by default. However, the assertion on this line directly accesses megatron_config.use_mbridge, which will raise an AttributeError if megatron_config is None. This can lead to an unhandled crash. It's more robust to check for None before accessing its attributes to provide a clearer error message.

Suggested change
assert megatron_config.use_mbridge, "use_mbridge must be True"
assert megatron_config is not None and megatron_config.use_mbridge, "megatron_config must be provided and use_mbridge must be True"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant