[MoE Refactor] EPLB refactoring for FusedMoE#41055
Conversation
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the EplbManager class to centralize Expert Parallelism Load Balancing (EPLB) logic, refactoring MoE layers and routers to delegate state management and weight collection. Feedback recommends using p.detach() instead of p.data for safer tensor access and replacing assertions with explicit RuntimeErrors for critical validation to prevent issues in optimized Python environments.
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
|
https://github.com/neuralmagic/vllm/blob/2bc4adcc0e4d5a58cf2b69cab9d6126ba8882641/vllm/distributed/eplb/eplb_state.py#L642-L647 https://github.com/neuralmagic/vllm/blob/2bc4adcc0e4d5a58cf2b69cab9d6126ba8882641/vllm/model_executor/layers/quantization/modelopt.py#L1934-L1937 AI also found:
Not caused by this refactor but likely a bug: |
|
Instead of creating a |
Signed-off-by: Bill Nell <bnell@redhat.com>
Good point. I've redone the PR so that there's still only |
Signed-off-by: Bill Nell <bnell@redhat.com>
| ) | ||
|
|
||
| if self.enable_eplb and not self.quant_method.supports_eplb: | ||
| if enable_eplb and not self.quant_method.supports_eplb: |
There was a problem hiding this comment.
note to self, it seems weird that It would be the quant method that determines this
Shoudlnt it be the kernel?
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
…oject#41055 API PR vllm-project#41055 ([MoE Refactor] EPLB refactoring for FusedMoE) removed the `enable_eplb` parameter from `BaseRouter.__init__`; the new API uses `eplb_state=None` (disabled) vs. populated `eplb_state` (enabled). Reverting vllm-project#39917 restored the pre-vllm-project#39917 test file that still passed `enable_eplb=False`, causing TypeError on import/instantiation. Align the test helper with the current API: `_make_router` now takes an optional `eplb_state` (defaults to None), and the EPLB-enabled test builds a fully-populated state and passes it in. Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: aoshen02 <aoshen@inferact.ai>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? 1. fix vllm-project/vllm#33322 overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for the sceniro `pp+sp+tp`, skip scatter the residual for ascend 2. vllm-project/vllm#35520 Adapted to the modifications of `ModelRunner v2` for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. vllm-project/vllm#40711 4. vllm-project/vllm#42121 5. vllm-project/vllm#41706 6. vllm-project/vllm#39917 Disable `async_schedule` when `enable_return_routed_experts=True` 7. vllm-project/vllm#41046 8. vllm-project/vllm#41055 9. vllm-project/vllm#41035 10. vllm-project/vllm#42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (#1436) Fix upstream regressions affecting hourly CI: 1. **MultiModelEngineClient**: Added missing `notify_kv_transfer_request_rejected` abstract method (upstream PR vllm-project/vllm#41269) 2. **Qwen3.5 test harness**: Updated `test_common.py` to read `enforce_eager` from model card config (with env var override), enabling per-model compilation control 3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and `enable_eplb` parameter from `patched_create_fused_moe_router` after upstream MoE refactor (upstream PR vllm-project/vllm#41055) Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by #1449, which should merge first. Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors). --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Purpose
eplb_state | Noneinstead of enable_eplb flag + eplb_state inFusedMoEand router classes.setmethod toEplbLayerState.Test Plan
CI
Test Result
cc @yzong-rh
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.