[megatron] fix: support hybrid dense/MoE models in router replay with PP/VPP by xhx1022 · Pull Request #5452 · verl-project/verl

xhx1022 · 2026-03-02T03:08:29Z

What does this PR do?

Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq).
This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers.

Although #5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP.
Specifically:

Bug 1 — Incorrect VPP offset (root cause): In

verl/verl/utils/megatron/router_replay_utils.py

Line 422 in c179476

num_layers_to_build = get_num_layers_to_build(tf_config, pre_vp_stage)

, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models
this over-counts the offset, causing the wrong slice of router instances to be selected.
Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list,

verl/verl/utils/megatron/router_replay_utils.py

Line 256 in c179476

router_instances_list = RouterReplayHelper.get_micro_batch_router_list(tf_config, vp_rank)

either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers.

The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count.

Key changes:

Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq
Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models
Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers
Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

… PP/VP Signed-off-by: xhx1022 <1737006628@qq.com>

CLAassistant · 2026-03-02T03:08:36Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request addresses a critical issue with router replay in hybrid dense/MoE models, particularly when pipeline parallelism is enabled. The changes correctly distinguish MoE layers from dense layers and adjust layer indexing and offset calculations accordingly. The refactoring to remove unnecessary layer number tracking is also a good cleanup. Overall, the changes are well-implemented and address the core problem described. I've identified one critical issue in a new helper function that could lead to a crash and have provided a suggestion to fix it.

gemini-code-assist · 2026-03-02T03:10:29Z

+def is_moe_layer(tf_config, layer_idx):
+    moe_layer_freq = getattr(tf_config, "moe_layer_freq", None)
+
+    if isinstance(moe_layer_freq, int):
+        return layer_idx % moe_layer_freq == 0
+    elif isinstance(moe_layer_freq, list):
+        return moe_layer_freq[layer_idx] == 1
+    else:
+        raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}")


The is_moe_layer function has two potential issues that could lead to runtime errors:

Unset moe_layer_freq: The docstring for get_moe_num_layers_to_build specifies that if moe_layer_freq is unset, all layers should be treated as MoE layers. However, the current implementation getattr(tf_config, "moe_layer_freq", None) will cause moe_layer_freq to be None, which then leads to a ValueError. This will cause a crash when router replay is used with a model that doesn't explicitly define moe_layer_freq.

moe_layer_freq is zero: If moe_layer_freq is set to 0, the expression layer_idx % moe_layer_freq will cause a ZeroDivisionError.

To make the function more robust, it should default to 1 for an unset moe_layer_freq and explicitly handle non-positive values.

Suggested change

def is_moe_layer(tf_config, layer_idx):

moe_layer_freq = getattr(tf_config, "moe_layer_freq", None)

if isinstance(moe_layer_freq, int):

return layer_idx % moe_layer_freq == 0

elif isinstance(moe_layer_freq, list):

return moe_layer_freq[layer_idx] == 1

else:

raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}")

def is_moe_layer(tf_config, layer_idx):

moe_layer_freq = getattr(tf_config, "moe_layer_freq", 1)

if isinstance(moe_layer_freq, int):

if moe_layer_freq <= 0:

return False

return layer_idx % moe_layer_freq == 0

elif isinstance(moe_layer_freq, list):

return moe_layer_freq[layer_idx] == 1

else:

raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}")

Signed-off-by: xhx1022 <1737006628@qq.com>

PeterSH6

Nice work!

… PP/VPP (#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project/verl#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>

… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>

[megatron] fix: support hybrid dense/MoE models in router replay with…

d895a89

… PP/VP Signed-off-by: xhx1022 <1737006628@qq.com>

xhx1022 requested review from eric-haibin-lin and vermouth1992 as code owners March 2, 2026 03:08

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

xhx1022 added 2 commits March 2, 2026 20:49

update

c589c97

Signed-off-by: xhx1022 <1737006628@qq.com>

update

bcb41b2

Signed-off-by: xhx1022 <1737006628@qq.com>

PeterSH6 approved these changes Mar 3, 2026

View reviewed changes

PeterSH6 merged commit 5ab49ec into verl-project:main Mar 3, 2026
100 of 117 checks passed

hjh0119 mentioned this pull request Mar 6, 2026

[megatron]feat: Add routing replay support for Megatron-Swift GRPO modelscope/ms-swift#8196

Merged

4 tasks

xhx1022 deleted the fix/vpp_hybrid_model branch March 21, 2026 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] fix: support hybrid dense/MoE models in router replay with PP/VPP#5452

[megatron] fix: support hybrid dense/MoE models in router replay with PP/VPP#5452
PeterSH6 merged 3 commits intoverl-project:mainfrom
xhx1022:fix/vpp_hybrid_model

xhx1022 commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

PeterSH6 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xhx1022 commented Mar 2, 2026

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

PeterSH6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants