[megatron] support megatron expert parallel by ISEEKYAN · Pull Request #1467 · verl-project/verl

ISEEKYAN · 2025-05-09T16:00:28Z

Checklist Before Starting

What does this PR do?

support expert parallel in megatron

High-Level Design

introduce EPsize and ETPsize
ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP

Specific Changes

mcore model initilize
megatron vllm parameter transfer

API

Usage Example

LLM=models/Qwen1.5-MoE-A2.7B-Chat
NODES=1
PP=2
TP=4
VLLM_TP=4
EP=4
ETP=1

python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$LLM \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    critic.optim.lr=1e-5 \
    critic.model.path=$LLM \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=1 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_megatron_gsm8k_examples' \
    trainer.experiment_name='qwen_moe_instruct_1node_ep' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=$NODES \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
    critic.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
    critic.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
    critic.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
    critic.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    critic.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    critic.megatron.param_offload=True \
    trainer.total_epochs=100 $@

Test

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

zhaochenyang20 · 2025-05-14T06:33:38Z

Please add support for SGLang part. We will also help to evalute.

ISEEKYAN · 2025-05-19T02:52:45Z

Please add support for SGLang part. We will also help to evalute.

SGLang part is added, please help to evaluate, thanks

zhaochenyang20 · 2025-05-19T17:57:37Z

Please add support for SGLang part. We will also help to evalute.

SGLang part is added, please help to evaluate, thanks

@GeLee-Q is working on validaiton. thanks!

add qwen3 scripts

ETOgaosion

Tested with qwen3-30b vllm, notice that some loss(kl loss) have more divergence with fsdp, but not influence the test results.

Might be related with gradient_accumulation var? Will fix with the dynamic_bsz PR

### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to #1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

…t#1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to verl-project#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>

…t#1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to verl-project#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>

…t#1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to verl-project#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>

…t#1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to verl-project#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

support megatron EP

ccf3c77

ccclyu self-requested a review May 10, 2025 08:13

ccclyu added the status: need review label May 10, 2025

kakao-charlie-cs mentioned this pull request May 17, 2025

Does Megatron support expert_model_parallel_size? #1556

Open

ISEEKYAN added 2 commits May 18, 2025 19:44

Merge branch 'main' into mcore_ep

3488b25

sglang mcore ep

11d168c

change default value of etp config

1f22095

ETOgaosion and others added 4 commits May 20, 2025 14:59

add qwen3 scripts

9df7a77

Merge pull request #6 from ETOgaosion/test_ep

c77205c

add qwen3 scripts

Merge branch 'main' into mcore_ep

58f8244

Merge branch 'main' into mcore_ep

b419ea3

ISEEKYAN mentioned this pull request May 20, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

ETOgaosion approved these changes May 21, 2025

View reviewed changes

ETOgaosion merged commit add17f0 into verl-project:main May 21, 2025
35 checks passed

ETOgaosion mentioned this pull request May 21, 2025

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

Merged

6 tasks

zzong2006 mentioned this pull request May 22, 2025

[Megatron] Support optimizer offload for moe when ep > 1 #1638

Merged

6 tasks

eric-haibin-lin mentioned this pull request May 31, 2025

[Project] deepseek R1 infrastructure #708

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] support megatron expert parallel#1467

[megatron] support megatron expert parallel#1467
ETOgaosion merged 8 commits intoverl-project:mainfrom
ISEEKYAN:mcore_ep

ISEEKYAN commented May 9, 2025

Uh oh!

zhaochenyang20 commented May 14, 2025

Uh oh!

ISEEKYAN commented May 19, 2025

Uh oh!

zhaochenyang20 commented May 19, 2025

Uh oh!

ETOgaosion left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ISEEKYAN commented May 9, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

zhaochenyang20 commented May 14, 2025

Uh oh!

ISEEKYAN commented May 19, 2025

Uh oh!

zhaochenyang20 commented May 19, 2025

Uh oh!

ETOgaosion left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ETOgaosion left a comment •

edited

Loading