[MoE Refactor] Split of DefaultMoERunner class#35326
[MoE Refactor] Split of DefaultMoERunner class#35326robertgshaw2-redhat merged 54 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the MoE runner logic. It splits the monolithic DefaultMoERunner into a MoERunnerBase, a ChunkedMoERunner for DP chunking, and a DefaultMoERunner for the non-chunked path. Additionally, it introduces a SharedExperts class to encapsulate the logic for shared experts, which greatly improves code organization and clarity. The overall changes make the MoE execution path more modular and easier to understand.
I've found one critical issue in the new ChunkingMoERunner related to incorrect index clamping when handling chunks, which could lead to runtime errors. I've provided a suggestion to fix this. Apart from that, the refactoring is excellent.
| chunk_start = min(chunk_start, num_tokens - 1) | ||
| chunk_end = min(chunk_end, num_tokens) |
There was a problem hiding this comment.
The clamping logic for chunk_start is incorrect when num_tokens is 0. num_tokens - 1 becomes -1, which can lead to incorrect slicing and potential errors. For example, if num_tokens is 0, chunk_start becomes -1, and slice_size in _slice_and_copy_input becomes 1, while the sliced orig_slice is empty, causing a shape mismatch error in copy_.
The clamping should be against num_tokens instead of num_tokens - 1 to correctly handle empty chunks.
| chunk_start = min(chunk_start, num_tokens - 1) | |
| chunk_end = min(chunk_end, num_tokens) | |
| chunk_start = min(chunk_start, num_tokens) | |
| chunk_end = min(chunk_end, num_tokens) |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
| assert shared_experts_input is not None | ||
| self._shared_experts.apply(shared_experts_input, order) | ||
|
|
||
| def _apply_quant_method( |
There was a problem hiding this comment.
note: in a follow up, we should change this name. its not really applying the quant method if anything its running the whole layer
| # Needed for string -> FusedMoE layer lookup in custom ops. | ||
| self.layer_name = layer_name | ||
|
|
||
| self.forward_entry = self._select_forward() |
There was a problem hiding this comment.
in a follow up, changing this name to "custom_op_entry" or something like that would make things a lot clearer
| out_slice.copy_(orig_slice, non_blocking=True) | ||
| return out_slice | ||
|
|
||
| def _forward_impl( |
There was a problem hiding this comment.
Is it really needed to chunk the shared expert? I actually dont quite remember why this was required in the first place. I think that we could simplify things a lot if we only chunked the grouped expert
Shouldnt be done in this PR, but something to consider for the follow up
There was a problem hiding this comment.
It makes things easier if it is chunked since the MK naturally chunks it if it is the one executing it.
| logits_shape = (moe.max_num_tokens, self.moe_config.num_logical_experts) | ||
|
|
||
| # Does this need some kind of profiling run check like modular_kernel.py? | ||
| return current_workspace_manager().get_simultaneous( |
There was a problem hiding this comment.
Why is this current_workspace_manager introduced?
I didn't see this in previous implementation
There was a problem hiding this comment.
previously, it was
device = torch.accelerator.current_device_index()
self.batched_hidden_states = torch.zeros(
states_shape,
dtype=moe.in_dtype,
device=device,
)
self.batched_router_logits = torch.zeros(
logits_shape,
dtype=moe.router_logits_dtype,
device=device,
)
There was a problem hiding this comment.
Yeah, the workspace is now shared among all layers. Previously there were separate buffers for each layer.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bill Nell <bnell@redhat.com>
93bada4
into
vllm-project:main
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…xtral, MoE and Granite regressions (#1311) ## Summary This PR fixes a set of regressions introduced by recent upstream changes and observed in vLLM-Gaudi hourly validation. The branch now includes: - Pixtral HPUAttention projection path fix, - MoE dispatch and method override alignment updates for fused MoE and compressed tensors, - unit test updates to match the new MoE runner API usage, - fix hybrid model page size alignment for Granite 4.0-H. ## Related upstream PRs that introduced the regressions - vllm-project/vllm#37234 - vllm-project/vllm#35153 - vllm-project/vllm#36963 - vllm-project/vllm#38960 - vllm-project/vllm#35326 - vllm-project/vllm#37467 --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
| if self.gate is not None: | ||
| router_logits, _ = self.gate(hidden_states) |
There was a problem hiding this comment.
will this cause gate be called twice? eg, qwen3_moe will calculate gate in modeling file https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen3_moe.py#L235-L237
There was a problem hiding this comment.
It does look like it might be getting called twice in qwen3_moe.py. If the gate is passed to FusedMoE then the model should check whether or not to run it, e.g. see if self.experts.is_internal_router in deepseek_v2.py. This behavior isn't new so the models need to decide whether or not they want the FusedMoE to handle it or not.
There was a problem hiding this comment.
we should fix this urgently
Purpose
Split
DefaultMoERunnerinto two more classes:MoERunnerBasethat contains common code for all runners.ChunkingMoERunnera runner that handles DP chunking. Inherits fromMoERunnerBase. Acts as a wrapper around anotherMoERunnerinstance in order to be usable with any runner type.DefaultMoERunnerinherits fromMoERunnerBaseand only handles the non-chunked/naive execution path.In
ChunkingMoERunner, allocate the backing buffers for chunking using theworkspaceclass so there won't be one set of buffers per layer.Based off #35153
Test Plan
tests/kernels/moe
tests/lora/{test_gptoss_tp.py, test_olmoe.py}
Hand testing of Deepseek and other models w/EP+DP.
MoE refactor CI tests
Test Result
cc @robertgshaw2-redhat , @yzong-rh
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.