[megatron] fix: support hybrid dense/MoE models in router replay with PP/VPP#5452
Conversation
… PP/VP Signed-off-by: xhx1022 <1737006628@qq.com>
|
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue with router replay in hybrid dense/MoE models, particularly when pipeline parallelism is enabled. The changes correctly distinguish MoE layers from dense layers and adjust layer indexing and offset calculations accordingly. The refactoring to remove unnecessary layer number tracking is also a good cleanup. Overall, the changes are well-implemented and address the core problem described. I've identified one critical issue in a new helper function that could lead to a crash and have provided a suggestion to fix it.
| def is_moe_layer(tf_config, layer_idx): | ||
| moe_layer_freq = getattr(tf_config, "moe_layer_freq", None) | ||
|
|
||
| if isinstance(moe_layer_freq, int): | ||
| return layer_idx % moe_layer_freq == 0 | ||
| elif isinstance(moe_layer_freq, list): | ||
| return moe_layer_freq[layer_idx] == 1 | ||
| else: | ||
| raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}") |
There was a problem hiding this comment.
The is_moe_layer function has two potential issues that could lead to runtime errors:
- Unset
moe_layer_freq: The docstring forget_moe_num_layers_to_buildspecifies that ifmoe_layer_freqis unset, all layers should be treated as MoE layers. However, the current implementationgetattr(tf_config, "moe_layer_freq", None)will causemoe_layer_freqto beNone, which then leads to aValueError. This will cause a crash when router replay is used with a model that doesn't explicitly definemoe_layer_freq. moe_layer_freqis zero: Ifmoe_layer_freqis set to0, the expressionlayer_idx % moe_layer_freqwill cause aZeroDivisionError.
To make the function more robust, it should default to 1 for an unset moe_layer_freq and explicitly handle non-positive values.
| def is_moe_layer(tf_config, layer_idx): | |
| moe_layer_freq = getattr(tf_config, "moe_layer_freq", None) | |
| if isinstance(moe_layer_freq, int): | |
| return layer_idx % moe_layer_freq == 0 | |
| elif isinstance(moe_layer_freq, list): | |
| return moe_layer_freq[layer_idx] == 1 | |
| else: | |
| raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}") | |
| def is_moe_layer(tf_config, layer_idx): | |
| moe_layer_freq = getattr(tf_config, "moe_layer_freq", 1) | |
| if isinstance(moe_layer_freq, int): | |
| if moe_layer_freq <= 0: | |
| return False | |
| return layer_idx % moe_layer_freq == 0 | |
| elif isinstance(moe_layer_freq, list): | |
| return moe_layer_freq[layer_idx] == 1 | |
| else: | |
| raise ValueError(f"Unsupported moe_layer_freq type: {type(moe_layer_freq)}") |
… PP/VPP (#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project/verl#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
… PP/VPP (verl-project#5452) What does this PR do? Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq). This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers. Although verl-project#5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP. Specifically: - Bug 1 — Incorrect VPP offset (root cause): In https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L422, get_num_layers_to_build() was used to compute the offset across prior VPP stages. This returns the count of all transformer layers (including dense layers), but RouterReplay instances only exist on MoE layers. For hybrid models this over-counts the offset, causing the wrong slice of router instances to be selected. - Bug 2 — Replay data not set correctly (consequence): Because Bug 1 returns the wrong router instance list, https://github.com/verl-project/verl/blob/c179476754150a5384f96d56b622a8f6330d2c04/verl/utils/megatron/router_replay_utils.py#L256 either assigns target_indices to the wrong router or goes out of bounds, so replay data is never correctly dispatched to the corresponding MoE layers. The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count. Key changes: - Add is_moe_layer() and get_moe_num_layers_to_build() helpers to distinguish MoE layers from dense layers based on moe_layer_freq - Rewrite set_router_replay_data() to correctly index router replay data by MoE-layer ordinal for R2 mode with mixed dense/MoE models - Fix VPP offset calculation in pp_gather() and RouterReplayHelper to count only MoE layers instead of all transformer layers - Remove unnecessary layer_number tracking from RouterReplay patch to minimize intrusive changes to Megatron. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: xhx1022 <1737006628@qq.com>
What does this PR do?
Router replay previously assumed all transformer layers are MoE layers, which caused incorrect layer indexing for hybrid models (e.g., models with both dense and MoE layers determined by moe_layer_freq).
This led to bugs when using pipeline parallelism (PP) and virtual pipeline parallelism (VPP), as layer offset calculations did not account for dense layers.
Although #5037 introduced the router replay mechanism by patching Megatron's TopKRouter, it did not fully handle hybrid (dense + MoE) models under VPP.
Specifically:
verl/verl/utils/megatron/router_replay_utils.py
Line 422 in c179476
this over-counts the offset, causing the wrong slice of router instances to be selected.
verl/verl/utils/megatron/router_replay_utils.py
Line 256 in c179476
The same issue also exists in pp_gather(), where VPP offset calculation must slice gathered data by MoE layer count rather than total layer count.
Key changes:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.