Skip to content

[BugFix][Router Replay] Capture Logical Experts with EPLB#33013

Merged
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
HollowMan6:r3_replay
Jan 31, 2026
Merged

[BugFix][Router Replay] Capture Logical Experts with EPLB#33013
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
HollowMan6:r3_replay

Conversation

@HollowMan6
Copy link
Copy Markdown
Contributor

@HollowMan6 HollowMan6 commented Jan 24, 2026

Purpose

In the latest vLLM code, the routed experts capture logic is broken, asRoutedExpertsCapturer.create() runs after model construction (during KV-cache init), but FusedMoE only binds self.capture in init. The result is that capture never happens and the routed experts buffer remains all zeros.

When EPLB is enabled, vLLM maps logical expert IDs to physical IDs. Megatron expects logical IDs for replay. Capturing post-EPLB IDs breaks replay even if values are non-zero.

To tix these, Lazily bind the routed‑experts capturer in FusedMoE and capture logical top‑k ids pre‑EPLB via BaseRouter; use a shared selection helper for chunked and non‑chunked forward paths to keep capture consistent.

Test Plan

Tested together with Verl router replay R3:

Test Result

Without this fix:
image

With this fix, it looks good now:
image


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.

Copilot AI review requested due to automatic review settings January 24, 2026 20:59
@mergify mergify Bot added the bug Something isn't working label Jan 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of capturing logical routed expert IDs for replay, especially when EPLB is enabled. The introduction of _select_experts_for_forward centralizes the expert selection logic for both chunked and non-chunked forward paths, which is a good refactoring that improves code clarity and maintainability. The lazy initialization of the routed-experts capturer is also a good practice. The changes appear correct and well-implemented. I don't have any concerns with this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in the capture of routed experts for replay functionality. The main issue was that when Expert Parallelism Load Balancing (EPLB) is enabled, the system was capturing physical expert IDs after EPLB mapping instead of logical expert IDs before mapping, which caused incorrect replay behavior.

Changes:

  • Introduces lazy binding for the routed-experts capturer to handle cases where the capturer is not available during initialization
  • Adds a shared helper method _select_experts_for_forward() that captures logical expert IDs before EPLB mapping
  • Refactors both chunked and non-chunked forward paths to use the new helper for consistent capture behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
wuxibin89 pushed a commit to verl-project/verl that referenced this pull request Jan 28, 2026
…data with global layer indices (#5037)

### What does this PR do?

DeepSeek-V3-style MoE employs a hybrid architecture with the first three
layers as dense FFN blocks before switching to MoE layers, which means
not every layer has a router.

This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM
reports routed_experts across all transformer layers (including dense).
Megatron only has routers for MoE layers. Mapping with i + offset
silently shifts every MoE layer after a dense layer. So, when
routed‑experts tensors include dense layers (full `num_layers`), we
should map replay data by each router’s global layer_number; Otherwise,
we should fall back to local offset indexing and validate bounds to
catch mismatches. We also patch TopKRouter.set_layer_number to store the
global layer number in each RouterReplay instance so global alignment is
reliable with VPP/PP.

Dependent on vllm-project/vllm#33013

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

Without this fix:
<img width="3646" height="1132" alt="image"
src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce"
/>

With this fix, it looks good now:
<img width="3668" height="1210" alt="image"
src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [X] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.

<sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind
Lab</a> - A Lab for Experiential Intelligence.</sub>

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Thanks for this PR!

A quick question can we move this functionality into the Router?

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
@HollowMan6
Copy link
Copy Markdown
Contributor Author

Thank you for your suggestions @robertgshaw2-redhat ! I have refactored the fix accordingly and have tested together with verl-project/verl#5093. Everything works fine!

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 28, 2026

Hi @HollowMan6, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in capturing routed experts for replay, which was caused by incorrect timing of capturer initialization and capturing physical instead of logical expert IDs. The changes introduce a lazy binding mechanism for the RoutedExpertsCapturer and move the capture logic to the BaseRouter to ensure logical IDs are captured before EPLB mapping. This also centralizes the capture logic, making it consistent for both chunked and non-chunked forward paths. The implementation is clean and effectively resolves the described issues. The use of a default argument in the lambda to capture the layer ID is a good pattern to avoid late binding issues. Overall, this is a solid fix.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 28, 2026

Hi @HollowMan6, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 28, 2026

Hi @HollowMan6, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@HollowMan6
Copy link
Copy Markdown
Contributor Author

Finally I've managed to fix the mypy issue😅, this is now ready again @robertgshaw2-redhat

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Thank you for making this simplification! I think the code quality looks great.

I will proceed with merging the PR

Separately, Im wondering if there is any CI test coverage for this capture feature? It would be nice to have it so we can ensure the correctness of the implementation over time.

Can be done in a separate PR.

@robertgshaw2-redhat robertgshaw2-redhat changed the title [BugFix] Capture logical routed experts reliably for replay [BugFix][Router Replay] Capture Logical Experts with EPLB Jan 29, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) January 29, 2026 18:40
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026
Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting until bill's comment is addressed

auto-merge was automatically disabled January 30, 2026 09:22

Head branch was pushed to by a user without write access

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 30, 2026

Hi @HollowMan6, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

In the latest vLLM code, the routed experts capture logic is broken, asRoutedExpertsCapturer.create() runs after model construction (during KV-cache init), but FusedMoE only binds self.capture in init. The result is that capture never happens and the routed experts buffer remains all zeros.

When EPLB is enabled, vLLM maps logical expert IDs to physical IDs. Megatron expects logical IDs for replay. Capturing post-EPLB IDs breaks replay even if values are non-zero.

To tix these, Lazily bind the routed‑experts capturer in `FusedMoE` and capture logical top‑k ids pre‑EPLB via `BaseRouter`; use a shared selection helper for chunked and non‑chunked forward paths to keep capture consistent.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6
Copy link
Copy Markdown
Contributor Author

I also added some test cases here, this PR should be ready for merge. cc: @robertgshaw2-redhat

@robertgshaw2-redhat robertgshaw2-redhat merged commit 13b842f into vllm-project:main Jan 31, 2026
49 checks passed
@HollowMan6 HollowMan6 deleted the r3_replay branch January 31, 2026 15:16
JacobHelwig pushed a commit to JacobHelwig/verl that referenced this pull request Feb 3, 2026
…data with global layer indices (verl-project#5037)

### What does this PR do?

DeepSeek-V3-style MoE employs a hybrid architecture with the first three
layers as dense FFN blocks before switching to MoE layers, which means
not every layer has a router.

This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM
reports routed_experts across all transformer layers (including dense).
Megatron only has routers for MoE layers. Mapping with i + offset
silently shifts every MoE layer after a dense layer. So, when
routed‑experts tensors include dense layers (full `num_layers`), we
should map replay data by each router’s global layer_number; Otherwise,
we should fall back to local offset indexing and validate bounds to
catch mismatches. We also patch TopKRouter.set_layer_number to store the
global layer number in each RouterReplay instance so global alignment is
reliable with VPP/PP.

Dependent on vllm-project/vllm#33013

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

Without this fix:
<img width="3646" height="1132" alt="image"
src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce"
/>

With this fix, it looks good now:
<img width="3668" height="1210" alt="image"
src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [X] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.

<sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind
Lab</a> - A Lab for Experiential Intelligence.</sub>

Signed-off-by: Hollow Man <hollowman@opensuse.org>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…ct#33013)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Pai <416932041@qq.com>
DaizeDong pushed a commit to DaizeDong/verl that referenced this pull request Apr 19, 2026
…data with global layer indices (verl-project#5037)

### What does this PR do?

DeepSeek-V3-style MoE employs a hybrid architecture with the first three
layers as dense FFN blocks before switching to MoE layers, which means
not every layer has a router.

This PR fixes DeepSeek V3 architecture for router replay R3, as vLLM
reports routed_experts across all transformer layers (including dense).
Megatron only has routers for MoE layers. Mapping with i + offset
silently shifts every MoE layer after a dense layer. So, when
routed‑experts tensors include dense layers (full `num_layers`), we
should map replay data by each router’s global layer_number; Otherwise,
we should fall back to local offset indexing and validate bounds to
catch mismatches. We also patch TopKRouter.set_layer_number to store the
global layer number in each RouterReplay instance so global alignment is
reliable with VPP/PP.

Dependent on vllm-project/vllm#33013

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

Without this fix:
<img width="3646" height="1132" alt="image"
src="https://github.com/user-attachments/assets/d2400f03-4e25-4f52-8717-a23b58cc23ce"
/>

With this fix, it looks good now:
<img width="3668" height="1210" alt="image"
src="https://github.com/user-attachments/assets/7a9b4818-861f-4a52-8c13-90e6ed6f9530"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [X] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.

<sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind
Lab</a> - A Lab for Experiential Intelligence.</sub>

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants