[FIX] Add NO_MUL activation support for modular kernel path by danielafrimi · Pull Request #31528 · vllm-project/vllm

danielafrimi · 2025-12-30T10:38:02Z

This PR adds support for *_no_mul activations (e.g., relu2_no_mul) in the modular
kernel MoE path (TritonExperts).

Problem

The modular kernel path assumed all activations use gate/up multiplication (like SiLU, GELU),
where output size is N/2. For *_no_mul activations, which apply activation directly
without gating, output size should equal input size (N). This caused assertion failures
and buffer size mismatches.

Attional Change:
modular_kernel.py: Updated abstract workspace_shapes() signature and _allocate_buffers() to accept and pass activation parameter
All FusedMoEPermuteExpertsUnpermute implementations: Updated to accept activation parameter and use activation-aware workspace sizing where applicable

Note

Introduces activation-aware workspace sizing and centralized activation handling to support *_no_mul (e.g., relu2_no_mul) across modular-kernel MoE implementations.

API: workspace_shapes(..., activation) added and plumbed through all FusedMoEPermuteExpertsUnpermute implementations
Sizing: replace hardcoded N//2 with adjust_N_for_activation(N, activation) for activation result buffers
Activation: factor activation logic into utils.apply_moe_activation; replace ad‑hoc calls and remove local helpers
Buffers: update intermediate/cache allocations to use activation-aware dims; minor refactors in allocation paths (chunking/manager calls)
Coverage: updates span Cutlass, DeepGemm, Triton, Batched, FlashInfer, Marlin, and GPT OSS kernels

^{Written by Cursor Bugbot for commit 8d4a90c. This will update automatically on new commits. Configure here.}

Note

Introduces activation-aware handling to support non-gated *_no_mul activations across modular-kernel MoE implementations.

API: workspace_shapes(..., activation) added and plumbed through all FusedMoEPermuteExpertsUnpermute variants (Triton, Cutlass, DeepGemm, FlashInfer, Marlin, GPT OSS, batched/naive/fallback)
Sizing: replace hardcoded N // 2 with adjust_N_for_activation(N, activation); all intermediate/cache buffers now use activation-aware dims
Activation: new utils.apply_moe_activation centralizes both gated and no_mul behaviors; SILU_NO_MUL, GELU_NO_MUL, RELU2_NO_MUL constants exposed; remove scattered local activation code
Implementations updated to use the above in compute paths and buffer allocations (e.g., quant/DeepGemm/CUTLASS paths)
Tests: add tests/kernels/moe/test_triton_moe_no_act_mul.py validating shapes, execution, and adjust_N_for_activation behavior

^{Written by Cursor Bugbot for commit 1a391e7. This will update automatically on new commits. Configure here.}

Note

Introduces activation-aware handling for non-gated *_no_mul activations across modular-kernel MoE implementations.

API: workspace_shapes(..., activation) added and plumbed through all MoE backends (Triton, CUTLASS, DeepGemm, FlashInfer, Marlin, batched/fallback/OSS)
Sizing: replace hardcoded N // 2 with adjust_N_for_activation(N, activation); all intermediate/cache buffers allocate using activation-aware dims
Activation: new utils.apply_moe_activation unifies gated and no_mul activations; expose SILU_NO_MUL, GELU_NO_MUL, RELU2_NO_MUL
Implementation updates: buffer views and quant paths adjusted to use activation_out_dim; minor refactors (e.g., local quant_config var) where needed
Tests: add tests/kernels/moe/test_triton_moe_no_act_mul.py validating shapes, execution, and adjust_N_for_activation behavior

^{Written by Cursor Bugbot for commit e5c5630. This will update automatically on new commits. Configure here.}

Signed-off-by: dafrimi <dafrimi@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces support for "no_mul" activation functions (specifically silu_no_mul, gelu_no_mul, and relu2_no_mul) within the fused Mixture-of-Experts (MoE) layers. The changes include centralizing the definition of these activation names, adjusting workspace and cache dimensions from N // 2 to N to accommodate their different output size requirements (as they lack a gate/up split), and adding specific handling for relu2_no_mul in the modular_kernel.py's activation method, along with new assertions for input/output tensor sizes. The reviewer pointed out that the activation method in modular_kernel.py is missing implementations for silu_no_mul and gelu_no_mul, which would lead to a ValueError, and provided code to add torch.silu and torch.gelu calls for these activations.

vllm/model_executor/layers/fused_moe/modular_kernel.py

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/a8c791bdd13fe6c85ac327c3e60ddc61299bae02/model_executor/layers/fused_moe/modular_kernel.py#L602-L605
Add missing silu_no_mul/gelu_no_mul activation handling

The new _no_mul sizing logic allows activations like silu_no_mul/gelu_no_mul to pass the shape checks, but the dispatch here only implements relu2_no_mul and otherwise raises ValueError. If a model selects silu_no_mul or gelu_no_mul on the modular kernel path (both are now exported in utils and supported in the non‑modular path), it will still crash at runtime. Consider adding explicit cases for those activations (e.g., F.silu/F.gelu) or restricting _no_mul handling to only relu2_no_mul to avoid the false promise of support.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/fused_moe/modular_kernel.py

Signed-off-by: dafrimi <dafrimi@nvidia.com>

vllm/model_executor/layers/fused_moe/fused_moe.py

Signed-off-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Signed-off-by: <>

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

Signed-off-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Signed-off-by: <>

mgoin

Looks good to me! Sorry for all the back and forth, appreciate the careful work @danielafrimi

mergify · 2026-01-08T02:30:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danielafrimi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/layers/fused_moe/utils.py

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py

Signed-off-by: root <root@pool0-01777.cm.cluster> Signed-off-by: <>

vllm/model_executor/layers/fused_moe/cutlass_moe.py

cursor · 2026-01-11T14:39:50Z

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

        global_num_experts: int,
        local_num_experts: int,
        expert_tokens_meta: mk.ExpertTokensMetadata | None,
+        activation: str,


NaiveBatchedExperts ignores activation parameter causing incorrect dimension

High Severity

The NaiveBatchedExperts.workspace_shapes() method accepts the activation parameter but doesn't use it for workspace sizing. More critically, the apply() method hardcodes N = w1.size(1) // 2 which assumes gated activations where w1 has shape (E, 2*N, K). For *_no_mul activations where w1 has shape (E, N, K), this incorrectly computes N as half the expected value, causing buffer size mismatches and incorrect matrix operations when the activation is called on incorrectly-sized tensors.

Additional Locations (1)

vllm/model_executor/layers/fused_moe/fused_batched_moe.py#L717-L718

rabi · 2026-01-13T08:32:27Z

FWIW, it did not work for ROCm after this PR. I had to do changes as in #32244 to make it work.

…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>

…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>

…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…ject#31528) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>

fix

a8c791b

Signed-off-by: dafrimi <dafrimi@nvidia.com>

danielafrimi requested review from mgoin and pavanimajety as code owners December 30, 2025 10:38

gemini-code-assist bot reviewed Dec 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 30, 2025

View reviewed changes

mgoin reviewed Dec 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated Show resolved Hide resolved

danielafrimi added 2 commits December 31, 2025 10:39

refactor

3218c78

Signed-off-by: dafrimi <dafrimi@nvidia.com>

refactor

6787f72

Signed-off-by: dafrimi <dafrimi@nvidia.com>

mgoin reviewed Jan 4, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

mgoin requested review from bnellnm and robertgshaw2-redhat January 4, 2026 19:09

refactor

0d7fa53

Signed-off-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Signed-off-by: <>

danielafrimi requested a review from tjtanaa as a code owner January 5, 2026 09:36

mergify bot added gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm labels Jan 5, 2026

github-project-automation bot added this to NVIDIA and gpt-oss Issues & Enhancements Jan 5, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Jan 5, 2026

bnellnm reviewed Jan 5, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py Outdated Show resolved Hide resolved

root added 2 commits January 7, 2026 01:09

refactor

aae2367

Signed-off-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Signed-off-by: <>

fix

3e1c987

Signed-off-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Signed-off-by: <>

mgoin approved these changes Jan 7, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 7, 2026

mgoin enabled auto-merge (squash) January 7, 2026 20:26

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026

Merge branch 'main' into fix_path_moe

3fb7514

mergify bot added the needs-rebase label Jan 8, 2026

cursor bot reviewed Jan 11, 2026

View reviewed changes

root added 2 commits January 11, 2026 04:17

refactor

64ea89c

Signed-off-by: root <root@pool0-01777.cm.cluster> Signed-off-by: <>

refactor

1a391e7

Signed-off-by: root <root@pool0-01777.cm.cluster> Signed-off-by: <>

danielafrimi requested review from WoosukKwon, tlrmchlsmth and yewentao256 as code owners January 11, 2026 12:32

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/cutlass_moe.py Show resolved Hide resolved

Merge branch 'main' into fix_path_moe

e5c5630

cursor bot reviewed Jan 11, 2026

View reviewed changes

danisereb mentioned this pull request Jan 11, 2026

[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) #31983

Closed

5 tasks

mgoin approved these changes Jan 12, 2026

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 12, 2026

github-project-automation bot moved this from Done to Ready in NVIDIA Jan 12, 2026

mgoin merged commit 3f72639 into vllm-project:main Jan 12, 2026
57 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 12, 2026

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Jan 12, 2026

danisereb mentioned this pull request Jan 13, 2026

[Bug]: Nemotron Nano V3 FP16 on Jetson THOR #32093

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] Add NO_MUL activation support for modular kernel path#31528

[FIX] Add NO_MUL activation support for modular kernel path#31528
mgoin merged 12 commits intovllm-project:mainfrom
danielafrimi:fix_path_moe

danielafrimi commented Dec 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 11, 2026

Uh oh!

Uh oh!

rabi commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

danielafrimi commented Dec 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 11, 2026

Choose a reason for hiding this comment

NaiveBatchedExperts ignores activation parameter causing incorrect dimension

Uh oh!

Uh oh!

rabi commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielafrimi commented Dec 30, 2025 •

edited by github-actions bot

Loading