Skip to content

[FIX] Add NO_MUL activation support for modular kernel path#31528

Merged
mgoin merged 12 commits intovllm-project:mainfrom
danielafrimi:fix_path_moe
Jan 12, 2026
Merged

[FIX] Add NO_MUL activation support for modular kernel path#31528
mgoin merged 12 commits intovllm-project:mainfrom
danielafrimi:fix_path_moe

Conversation

@danielafrimi
Copy link
Copy Markdown
Contributor

@danielafrimi danielafrimi commented Dec 30, 2025

This PR adds support for *_no_mul activations (e.g., relu2_no_mul) in the modular
kernel MoE path (TritonExperts).

Problem

The modular kernel path assumed all activations use gate/up multiplication (like SiLU, GELU),
where output size is N/2. For *_no_mul activations, which apply activation directly
without gating, output size should equal input size (N). This caused assertion failures
and buffer size mismatches.

Attional Change:
modular_kernel.py: Updated abstract workspace_shapes() signature and _allocate_buffers() to accept and pass activation parameter
All FusedMoEPermuteExpertsUnpermute implementations: Updated to accept activation parameter and use activation-aware workspace sizing where applicable


Note

Introduces activation-aware workspace sizing and centralized activation handling to support *_no_mul (e.g., relu2_no_mul) across modular-kernel MoE implementations.

  • API: workspace_shapes(..., activation) added and plumbed through all FusedMoEPermuteExpertsUnpermute implementations
  • Sizing: replace hardcoded N//2 with adjust_N_for_activation(N, activation) for activation result buffers
  • Activation: factor activation logic into utils.apply_moe_activation; replace ad‑hoc calls and remove local helpers
  • Buffers: update intermediate/cache allocations to use activation-aware dims; minor refactors in allocation paths (chunking/manager calls)
  • Coverage: updates span Cutlass, DeepGemm, Triton, Batched, FlashInfer, Marlin, and GPT OSS kernels

Written by Cursor Bugbot for commit 8d4a90c. This will update automatically on new commits. Configure here.


Note

Introduces activation-aware handling to support non-gated *_no_mul activations across modular-kernel MoE implementations.

  • API: workspace_shapes(..., activation) added and plumbed through all FusedMoEPermuteExpertsUnpermute variants (Triton, Cutlass, DeepGemm, FlashInfer, Marlin, GPT OSS, batched/naive/fallback)
  • Sizing: replace hardcoded N // 2 with adjust_N_for_activation(N, activation); all intermediate/cache buffers now use activation-aware dims
  • Activation: new utils.apply_moe_activation centralizes both gated and no_mul behaviors; SILU_NO_MUL, GELU_NO_MUL, RELU2_NO_MUL constants exposed; remove scattered local activation code
  • Implementations updated to use the above in compute paths and buffer allocations (e.g., quant/DeepGemm/CUTLASS paths)
  • Tests: add tests/kernels/moe/test_triton_moe_no_act_mul.py validating shapes, execution, and adjust_N_for_activation behavior

Written by Cursor Bugbot for commit 1a391e7. This will update automatically on new commits. Configure here.


Note

Introduces activation-aware handling for non-gated *_no_mul activations across modular-kernel MoE implementations.

  • API: workspace_shapes(..., activation) added and plumbed through all MoE backends (Triton, CUTLASS, DeepGemm, FlashInfer, Marlin, batched/fallback/OSS)
  • Sizing: replace hardcoded N // 2 with adjust_N_for_activation(N, activation); all intermediate/cache buffers allocate using activation-aware dims
  • Activation: new utils.apply_moe_activation unifies gated and no_mul activations; expose SILU_NO_MUL, GELU_NO_MUL, RELU2_NO_MUL
  • Implementation updates: buffer views and quant paths adjusted to use activation_out_dim; minor refactors (e.g., local quant_config var) where needed
  • Tests: add tests/kernels/moe/test_triton_moe_no_act_mul.py validating shapes, execution, and adjust_N_for_activation behavior

Written by Cursor Bugbot for commit e5c5630. This will update automatically on new commits. Configure here.

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for "no_mul" activation functions (specifically silu_no_mul, gelu_no_mul, and relu2_no_mul) within the fused Mixture-of-Experts (MoE) layers. The changes include centralizing the definition of these activation names, adjusting workspace and cache dimensions from N // 2 to N to accommodate their different output size requirements (as they lack a gate/up split), and adding specific handling for relu2_no_mul in the modular_kernel.py's activation method, along with new assertions for input/output tensor sizes. The reviewer pointed out that the activation method in modular_kernel.py is missing implementations for silu_no_mul and gelu_no_mul, which would lead to a ValueError, and provided code to add torch.silu and torch.gelu calls for these activations.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/a8c791bdd13fe6c85ac327c3e60ddc61299bae02/model_executor/layers/fused_moe/modular_kernel.py#L602-L605
P2 Badge Add missing silu_no_mul/gelu_no_mul activation handling

The new _no_mul sizing logic allows activations like silu_no_mul/gelu_no_mul to pass the shape checks, but the dispatch here only implements relu2_no_mul and otherwise raises ValueError. If a model selects silu_no_mul or gelu_no_mul on the modular kernel path (both are now exported in utils and supported in the non‑modular path), it will still crash at runtime. Consider adding explicit cases for those activations (e.g., F.silu/F.gelu) or restricting _no_mul handling to only relu2_no_mul to avoid the false promise of support.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>

Signed-off-by:  <>
@danielafrimi danielafrimi requested a review from tjtanaa as a code owner January 5, 2026 09:36
@mergify mergify bot added gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm labels Jan 5, 2026
root added 2 commits January 7, 2026 01:09
Signed-off-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>

Signed-off-by:  <>
fix
Signed-off-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>

Signed-off-by:  <>
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Sorry for all the back and forth, appreciate the careful work @danielafrimi

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026
@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 7, 2026
@mgoin mgoin enabled auto-merge (squash) January 7, 2026 20:26
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danielafrimi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 8, 2026
root added 2 commits January 11, 2026 04:17
Signed-off-by: root <root@pool0-01777.cm.cluster>

Signed-off-by:  <>
Signed-off-by: root <root@pool0-01777.cm.cluster>

Signed-off-by:  <>
global_num_experts: int,
local_num_experts: int,
expert_tokens_meta: mk.ExpertTokensMetadata | None,
activation: str,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaiveBatchedExperts ignores activation parameter causing incorrect dimension

High Severity

The NaiveBatchedExperts.workspace_shapes() method accepts the activation parameter but doesn't use it for workspace sizing. More critically, the apply() method hardcodes N = w1.size(1) // 2 which assumes gated activations where w1 has shape (E, 2*N, K). For *_no_mul activations where w1 has shape (E, N, K), this incorrectly computes N as half the expected value, causing buffer size mismatches and incorrect matrix operations when the activation is called on incorrectly-sized tensors.

Additional Locations (1)

Fix in Cursor Fix in Web

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 12, 2026
@github-project-automation github-project-automation bot moved this from Done to Ready in NVIDIA Jan 12, 2026
@mgoin mgoin merged commit 3f72639 into vllm-project:main Jan 12, 2026
57 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 12, 2026
@rabi
Copy link
Copy Markdown
Contributor

rabi commented Jan 13, 2026

FWIW, it did not work for ROCm after this PR. I had to do changes as in #32244 to make it work.

TomerBN-Nvidia pushed a commit to TomerBN-Nvidia/vllm that referenced this pull request Jan 13, 2026
…ject#31528)

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
sammysun0711 pushed a commit to sammysun0711/vllm that referenced this pull request Jan 16, 2026
…ject#31528)

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…ject#31528)

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…ject#31528)

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…ject#31528)

Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants