[MoE Refactor] Refactor ZeroExpertFusedMoE into new framework#35549
[MoE Refactor] Refactor ZeroExpertFusedMoE into new framework#35549robertgshaw2-redhat merged 93 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request is a major and well-executed refactoring of the Mixture of Experts (MoE) implementation. It successfully removes the ZeroExpertFusedMoE class by introducing a more modular design with a new ZeroExpertRouter, a MoERunner abstraction, and a dedicated SharedExperts class. This significantly improves the structure and extensibility of the MoE framework. The changes are consistent across the codebase and are supported by a comprehensive new test suite for the zero-expert functionality. I've identified one critical issue in the new ChunkingMoERunner that could cause a crash when handling empty inputs, and I've provided a suggestion to fix it. Overall, this is an excellent refactoring effort.
|
they may more activity in sglang.....so ... |
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
19ec9a0
into
vllm-project:main
…roject#35549) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com>
…stream regressions in HPU worker, MoE router, and offloading tests (#1354) Fix three upstream regressions that break HPU unit tests. ## Changes 1. **`vllm_gaudi/v1/worker/hpu_worker.py`** — `compile_or_warm_up_model()` now returns `CompilationTimes` NamedTuple instead of a plain `float`, matching the new upstream contract introduced in vllm-project/vllm#39240. 2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and `num_logical_experts` parameters to the HPU override of `create_fused_moe_router()`, plus `ZeroExpertRouter` dispatch, matching the refactor in vllm-project/vllm#35549. 3. **`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`** — Remove `block_size` from `OffloadingEvent` constructor calls and update assertion, matching the field removal in vllm-project/vllm#36644. ## Fixed tests - `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora` - `tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests` - `tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]` - `tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]` --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…stream regressions in HPU worker, MoE router, and offloading tests (vllm-project#1354) Fix three upstream regressions that break HPU unit tests. ## Changes 1. **`vllm_gaudi/v1/worker/hpu_worker.py`** — `compile_or_warm_up_model()` now returns `CompilationTimes` NamedTuple instead of a plain `float`, matching the new upstream contract introduced in vllm-project/vllm#39240. 2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and `num_logical_experts` parameters to the HPU override of `create_fused_moe_router()`, plus `ZeroExpertRouter` dispatch, matching the refactor in vllm-project/vllm#35549. 3. **`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`** — Remove `block_size` from `OffloadingEvent` constructor calls and update assertion, matching the field removal in vllm-project/vllm#36644. ## Fixed tests - `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora` - `tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests` - `tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]` - `tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]` --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…stream regressions in HPU worker, MoE router, and offloading tests (vllm-project#1354) Fix three upstream regressions that break HPU unit tests. ## Changes 1. **`vllm_gaudi/v1/worker/hpu_worker.py`** — `compile_or_warm_up_model()` now returns `CompilationTimes` NamedTuple instead of a plain `float`, matching the new upstream contract introduced in vllm-project/vllm#39240. 2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and `num_logical_experts` parameters to the HPU override of `create_fused_moe_router()`, plus `ZeroExpertRouter` dispatch, matching the refactor in vllm-project/vllm#35549. 3. **`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`** — Remove `block_size` from `OffloadingEvent` constructor calls and update assertion, matching the field removal in vllm-project/vllm#36644. ## Fixed tests - `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora` - `tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests` - `tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]` - `tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method` - `tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]` - `tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]` --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
…roject#35549) Signed-off-by: Bill Nell <bnell@redhat.com>
…roject#35549) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Purpose
Remove the
ZeroExpertFusedMoEclass and move its functionality intoFusedMoE,MoERunnerBaseand the newZeroExpertRouterclasses.based on #35326
cc @baonudesifeizhai, @OftenDream , @yzong-rh
Test Plan
Added new tests for zero experts.
Test Result
cc @yzong-rh
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.