[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949
[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949robertgshaw2-redhat merged 126 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the Mixture-of-Experts (MoE) implementation. The introduction of the MoERunner abstraction effectively separates the MoE execution logic from the layer definition, improving modularity and maintainability. The changes across various model files to adapt to this new abstraction are consistent and simplify the model-specific code. The addition of support for zero experts and the corresponding tests are also valuable.
However, I've identified a critical issue in vllm/model_executor/models/transformers/moe.py where a removed parameter is still being passed, which will lead to a runtime error. Additionally, there's a potential regression in the same file regarding the detection of shared experts, which now relies solely on parameter names instead of also checking the model configuration. Please see the detailed comments for suggestions on how to address these issues.
Note: Security Review did not run due to the size of the PR.
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
…er place Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Yifan <yzong@redhat.com>
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
] (#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>
…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>
…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
…stream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 (#1377) ## Summary Compatibility fixes for vLLM bump to `3975eb6de6`. Addresses breakages from multiple upstream PRs affecting NIXL connectors, MoE runner refactor, offloading tests, Qwen3 MoE models, and transformers v5 upgrade. ## Root Cause 1. **NIXL import gate** — Upstream PR vllm-project/vllm#39529 (commit `cc3993b05d`) moved NIXL imports to `vllm/distributed/nixl_utils.py` and changed the platform gate from `if not is_rocm()` to `if is_cuda()`. HPU is neither CUDA nor ROCm, so it falls into the `else` branch → tries `rixl._api` (ROCm-only) → fails → `NixlWrapper = None` → `RuntimeError("NIXL is not available")`. 2. **TpKVTopology rename** — Same upstream PR #39529 unified `TpKVTopology` + `HeteroTPTransferConfig` into `TransferTopology`, breaking vllm-gaudi NIXL connector imports. 3. **Offloading tests** — Upstream PR vllm-project/vllm#36645 changed `OffloadingManager.lookup()` API. 4. **MoE runner refactor** — Upstream PR vllm-project/vllm#35949 (commit `726efe177b`) moved reduce logic into `MoERunnerBase`, removing `reduce_results`, renaming `forward_dispatch` → `_forward_dispatch`, `forward_entry` → `_forward_entry`, `_maybe_reduce_output` → `_maybe_reduce_final_output`. Follow-up PR moved `MoERunnerBase` and `get_layer_from_name` to `moe_runner_base.py`. 5. **Qwen3 MoE** — `SharedFusedMoE` returns a combined tensor (not a tuple), and MoE runner now handles TP reduction internally, causing double-reduce in `qwen3_moe.py` / `qwen3_next.py`. 6. **Transformers v5 — granite tokenizer** — Upstream PR vllm-project/vllm#30566 updated transformers to allow v5. GPT2Tokenizer in v5 now respects `add_bos_token=True` (silently ignored in v4), causing degenerate outputs and 0.0 GSM8K accuracy on granite models. 7. **Transformers v5.6.x — DeepSeek-V2-Lite tokenizer** — In transformers v5.6.x, `LlamaTokenizerFast` was unified into `LlamaTokenizer`, which does not apply the ByteLevel BPE decoder declared in `tokenizer.json`. DeepSeek-V2-Lite-Chat's tokenizer decoding strips all spaces (Ġ chars not converted back), producing garbled output and 0.0 accuracy on GSM8K. Fixed natively in transformers v5.7.0. ## Fix 1. **NIXL import patch**: Add `patch_nixl_utils_for_hpu()` in `register_utils()` to monkey-patch `vllm.distributed.nixl_utils` — imports from `nixl._api` instead of `rixl._api` on HPU. Update `hetero_hpu_nixl_connector.py` to import from `vllm.distributed.nixl_utils` instead of hardcoded `nixl._api`. 2. **TpKVTopology → TransferTopology**: Rename in NIXL connector imports and monkey-patches. 3. **Offloading tests**: Replace `runner.manager.lookup.return_value` with `connector_scheduler._maximal_prefix_lookup`. 4. **MoE refactor**: Update imports (`MoERunnerBase` from `moe_runner_base`), method names (`_forward_dispatch`, `_forward_entry`, `_maybe_reduce_final_output`), remove dead `reduce_results` / `reduce_output()`. 5. **Qwen3 MoE**: Remove incorrect shared_expert tuple indexing and double TP reduction. 6. **Transformers v5 — granite**: Remove hardcoded `add_bos_token=True` from lm-eval model_args to fix GSM8K accuracy regression. 7. **Transformers v5.6.x — DeepSeek-V2-Lite**: Exclude `transformers 5.6.*` in `requirements.txt` to prevent installation of versions with broken ByteLevel BPE tokenizer decoding. Verified on Gaudi2: gsm8k accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Purpose
MoERunnerBaseand cleanup the final shared/fused all reduce code. Remove corresponding code from models that useSharedFusedMoEMoERunnerBase.Test Plan
CI MoE refactor tests
Run gsm8k evals on the following models:
Test Result
lm-eval
RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8with SPmain:
PR:
cc @robertgshaw2-redhat , @tlrmchlsmth , @yzong-rh
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.