[FIX] Add NO_MUL activation support for modular kernel path#31528
[FIX] Add NO_MUL activation support for modular kernel path#31528mgoin merged 12 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for "no_mul" activation functions (specifically silu_no_mul, gelu_no_mul, and relu2_no_mul) within the fused Mixture-of-Experts (MoE) layers. The changes include centralizing the definition of these activation names, adjusting workspace and cache dimensions from N // 2 to N to accommodate their different output size requirements (as they lack a gate/up split), and adding specific handling for relu2_no_mul in the modular_kernel.py's activation method, along with new assertions for input/output tensor sizes. The reviewer pointed out that the activation method in modular_kernel.py is missing implementations for silu_no_mul and gelu_no_mul, which would lead to a ValueError, and provided code to add torch.silu and torch.gelu calls for these activations.
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/a8c791bdd13fe6c85ac327c3e60ddc61299bae02/model_executor/layers/fused_moe/modular_kernel.py#L602-L605
Add missing silu_no_mul/gelu_no_mul activation handling
The new _no_mul sizing logic allows activations like silu_no_mul/gelu_no_mul to pass the shape checks, but the dispatch here only implements relu2_no_mul and otherwise raises ValueError. If a model selects silu_no_mul or gelu_no_mul on the modular kernel path (both are now exported in utils and supported in the non‑modular path), it will still crash at runtime. Consider adding explicit cases for those activations (e.g., F.silu/F.gelu) or restricting _no_mul handling to only relu2_no_mul to avoid the false promise of support.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
mgoin
left a comment
There was a problem hiding this comment.
Looks good to me! Sorry for all the back and forth, appreciate the careful work @danielafrimi
|
This pull request has merge conflicts that must be resolved before it can be |
| global_num_experts: int, | ||
| local_num_experts: int, | ||
| expert_tokens_meta: mk.ExpertTokensMetadata | None, | ||
| activation: str, |
There was a problem hiding this comment.
NaiveBatchedExperts ignores activation parameter causing incorrect dimension
High Severity
The NaiveBatchedExperts.workspace_shapes() method accepts the activation parameter but doesn't use it for workspace sizing. More critically, the apply() method hardcodes N = w1.size(1) // 2 which assumes gated activations where w1 has shape (E, 2*N, K). For *_no_mul activations where w1 has shape (E, N, K), this incorrectly computes N as half the expected value, causing buffer size mismatches and incorrect matrix operations when the activation is called on incorrectly-sized tensors.
Additional Locations (1)
|
FWIW, it did not work for ROCm after this PR. I had to do changes as in #32244 to make it work. |
…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>
…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>
…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>
This PR adds support for
*_no_mulactivations (e.g.,relu2_no_mul) in the modularkernel MoE path (
TritonExperts).Problem
The modular kernel path assumed all activations use gate/up multiplication (like SiLU, GELU),
where output size is
N/2. For*_no_mulactivations, which apply activation directlywithout gating, output size should equal input size (
N). This caused assertion failuresand buffer size mismatches.
Attional Change:
modular_kernel.py: Updated abstractworkspace_shapes()signature and_allocate_buffers()to accept and pass activation parameterAll
FusedMoEPermuteExpertsUnpermuteimplementations: Updated to accept activation parameter and use activation-aware workspace sizing where applicableNote
Introduces activation-aware workspace sizing and centralized activation handling to support
*_no_mul(e.g.,relu2_no_mul) across modular-kernel MoE implementations.workspace_shapes(..., activation)added and plumbed through allFusedMoEPermuteExpertsUnpermuteimplementationsN//2withadjust_N_for_activation(N, activation)for activation result buffersutils.apply_moe_activation; replace ad‑hoc calls and remove local helpersWritten by Cursor Bugbot for commit 8d4a90c. This will update automatically on new commits. Configure here.
Note
Introduces activation-aware handling to support non-gated
*_no_mulactivations across modular-kernel MoE implementations.workspace_shapes(..., activation)added and plumbed through allFusedMoEPermuteExpertsUnpermutevariants (Triton, Cutlass, DeepGemm, FlashInfer, Marlin, GPT OSS, batched/naive/fallback)N // 2withadjust_N_for_activation(N, activation); all intermediate/cache buffers now use activation-aware dimsutils.apply_moe_activationcentralizes both gated and no_mul behaviors;SILU_NO_MUL,GELU_NO_MUL,RELU2_NO_MULconstants exposed; remove scattered local activation codetests/kernels/moe/test_triton_moe_no_act_mul.pyvalidating shapes, execution, andadjust_N_for_activationbehaviorWritten by Cursor Bugbot for commit 1a391e7. This will update automatically on new commits. Configure here.
Note
Introduces activation-aware handling for non-gated
*_no_mulactivations across modular-kernel MoE implementations.workspace_shapes(..., activation)added and plumbed through all MoE backends (Triton, CUTLASS, DeepGemm, FlashInfer, Marlin, batched/fallback/OSS)N // 2withadjust_N_for_activation(N, activation); all intermediate/cache buffers allocate using activation-aware dimsutils.apply_moe_activationunifies gated and no_mul activations; exposeSILU_NO_MUL,GELU_NO_MUL,RELU2_NO_MULtests/kernels/moe/test_triton_moe_no_act_mul.pyvalidating shapes, execution, andadjust_N_for_activationbehaviorWritten by Cursor Bugbot for commit e5c5630. This will update automatically on new commits. Configure here.