[BugFix][Router Replay] Capture Logical Experts with EPLB#33013
[BugFix][Router Replay] Capture Logical Experts with EPLB#33013robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of capturing logical routed expert IDs for replay, especially when EPLB is enabled. The introduction of _select_experts_for_forward centralizes the expert selection logic for both chunked and non-chunked forward paths, which is a good refactoring that improves code clarity and maintainability. The lazy initialization of the routed-experts capturer is also a good practice. The changes appear correct and well-implemented. I don't have any concerns with this PR.
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in the capture of routed experts for replay functionality. The main issue was that when Expert Parallelism Load Balancing (EPLB) is enabled, the system was capturing physical expert IDs after EPLB mapping instead of logical expert IDs before mapping, which caused incorrect replay behavior.
Changes:
- Introduces lazy binding for the routed-experts capturer to handle cases where the capturer is not available during initialization
- Adds a shared helper method
_select_experts_for_forward()that captures logical expert IDs before EPLB mapping - Refactors both chunked and non-chunked forward paths to use the new helper for consistent capture behavior
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…data with global layer indices (#5037) ### What does this PR do? DeepSeek-V3-style MoE employs a hybrid architecture with the first three layers as dense FFN blocks before switching to MoE layers, which means not every layer has a router. This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM reports routed_experts across all transformer layers (including dense). Megatron only has routers for MoE layers. Mapping with i + offset silently shifts every MoE layer after a dense layer. So, when routed‑experts tensors include dense layers (full `num_layers`), we should map replay data by each router’s global layer_number; Otherwise, we should fall back to local offset indexing and validate bounds to catch mismatches. We also patch TopKRouter.set_layer_number to store the global layer number in each RouterReplay instance so global alignment is reliable with VPP/PP. Dependent on vllm-project/vllm#33013 ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. Without this fix: <img width="3646" height="1132" alt="image" src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce" /> With this fix, it looks good now: <img width="3668" height="1210" alt="image" src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [X] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. <sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind Lab</a> - A Lab for Experiential Intelligence.</sub> Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
Thanks for this PR! A quick question can we move this functionality into the Router? |
|
Thank you for your suggestions @robertgshaw2-redhat ! I have refactored the fix accordingly and have tested together with verl-project/verl#5093. Everything works fine! |
|
Hi @HollowMan6, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in capturing routed experts for replay, which was caused by incorrect timing of capturer initialization and capturing physical instead of logical expert IDs. The changes introduce a lazy binding mechanism for the RoutedExpertsCapturer and move the capture logic to the BaseRouter to ensure logical IDs are captured before EPLB mapping. This also centralizes the capture logic, making it consistent for both chunked and non-chunked forward paths. The implementation is clean and effectively resolves the described issues. The use of a default argument in the lambda to capture the layer ID is a good pattern to avoid late binding issues. Overall, this is a solid fix.
|
Hi @HollowMan6, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @HollowMan6, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Finally I've managed to fix the mypy issue😅, this is now ready again @robertgshaw2-redhat |
|
Thank you for making this simplification! I think the code quality looks great. I will proceed with merging the PR Separately, Im wondering if there is any CI test coverage for this capture feature? It would be nice to have it so we can ensure the correctness of the implementation over time. Can be done in a separate PR. |
robertgshaw2-redhat
left a comment
There was a problem hiding this comment.
waiting until bill's comment is addressed
Head branch was pushed to by a user without write access
|
Hi @HollowMan6, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
In the latest vLLM code, the routed experts capture logic is broken, asRoutedExpertsCapturer.create() runs after model construction (during KV-cache init), but FusedMoE only binds self.capture in init. The result is that capture never happens and the routed experts buffer remains all zeros. When EPLB is enabled, vLLM maps logical expert IDs to physical IDs. Megatron expects logical IDs for replay. Capturing post-EPLB IDs breaks replay even if values are non-zero. To tix these, Lazily bind the routed‑experts capturer in `FusedMoE` and capture logical top‑k ids pre‑EPLB via `BaseRouter`; use a shared selection helper for chunked and non‑chunked forward paths to keep capture consistent. Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
I also added some test cases here, this PR should be ready for merge. cc: @robertgshaw2-redhat |
…data with global layer indices (verl-project#5037) ### What does this PR do? DeepSeek-V3-style MoE employs a hybrid architecture with the first three layers as dense FFN blocks before switching to MoE layers, which means not every layer has a router. This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM reports routed_experts across all transformer layers (including dense). Megatron only has routers for MoE layers. Mapping with i + offset silently shifts every MoE layer after a dense layer. So, when routed‑experts tensors include dense layers (full `num_layers`), we should map replay data by each router’s global layer_number; Otherwise, we should fall back to local offset indexing and validate bounds to catch mismatches. We also patch TopKRouter.set_layer_number to store the global layer number in each RouterReplay instance so global alignment is reliable with VPP/PP. Dependent on vllm-project/vllm#33013 ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. Without this fix: <img width="3646" height="1132" alt="image" src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce" /> With this fix, it looks good now: <img width="3668" height="1210" alt="image" src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [X] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. <sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind Lab</a> - A Lab for Experiential Intelligence.</sub> Signed-off-by: Hollow Man <hollowman@opensuse.org>
…ct#33013) Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: Pai <416932041@qq.com>
…data with global layer indices (verl-project#5037) ### What does this PR do? DeepSeek-V3-style MoE employs a hybrid architecture with the first three layers as dense FFN blocks before switching to MoE layers, which means not every layer has a router. This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM reports routed_experts across all transformer layers (including dense). Megatron only has routers for MoE layers. Mapping with i + offset silently shifts every MoE layer after a dense layer. So, when routed‑experts tensors include dense layers (full `num_layers`), we should map replay data by each router’s global layer_number; Otherwise, we should fall back to local offset indexing and validate bounds to catch mismatches. We also patch TopKRouter.set_layer_number to store the global layer number in each RouterReplay instance so global alignment is reliable with VPP/PP. Dependent on vllm-project/vllm#33013 ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. Without this fix: <img width="3646" height="1132" alt="image" src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce" /> With this fix, it looks good now: <img width="3668" height="1210" alt="image" src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [X] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. <sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind Lab</a> - A Lab for Experiential Intelligence.</sub> Signed-off-by: Hollow Man <hollowman@opensuse.org>
Purpose
In the latest vLLM code, the routed experts capture logic is broken, asRoutedExpertsCapturer.create() runs after model construction (during KV-cache init), but FusedMoE only binds self.capture in init. The result is that capture never happens and the routed experts buffer remains all zeros.
When EPLB is enabled, vLLM maps logical expert IDs to physical IDs. Megatron expects logical IDs for replay. Capturing post-EPLB IDs breaks replay even if values are non-zero.
To tix these, Lazily bind the routed‑experts capturer in
FusedMoEand capture logical top‑k ids pre‑EPLB viaBaseRouter; use a shared selection helper for chunked and non‑chunked forward paths to keep capture consistent.Test Plan
Tested together with Verl router replay R3:
Test Result
Without this fix:

With this fix, it looks good now:

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.