Skip to content

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul#23707

Merged
ch-wan merged 4 commits intomainfrom
deprecate-act-and-mul-triton
Apr 26, 2026
Merged

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul#23707
ch-wan merged 4 commits intomainfrom
deprecate-act-and-mul-triton

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented Apr 25, 2026

Motivation

act_and_mul_triton (in fused_moe_triton_kernels.py) duplicates silu_and_mul / gelu_and_mul. The only difference is that it skips rows whose routed expert id is -1 (the filter_expert=True MoE path used under EP). The JIT CUDA silu_and_mul / gelu_and_mul kernels already exist and are faster — the consolidation removes ~100 lines of Triton and a redundant kernel.

Modifications

JIT activation kernel (CUDA)

  • python/sglang/jit_kernel/csrc/elementwise/activation.cuh: added expert_ids (const int32_t*) and expert_step (uint32_t) to ActivationParams; added a compile-time kFilterExpert template bool to act_and_mul_kernel (zero overhead when off — if constexpr skips the load); exposed run_activation_filtered host method.
  • python/sglang/jit_kernel/activation.py: silu_and_mul / gelu_and_mul / gelu_tanh_and_mul / run_activation now accept optional expert_ids and expert_step kwargs. Existing call sites are unchanged.

MoE call sites (triton_utils/fused_moe.py)

  • Replaced both act_and_mul_triton(...) calls with silu_and_mul / gelu_and_mul passing expert_ids and expert_step.
  • CUDA: filter_expert path → JIT kernel with expert_ids (skips filtered rows).
  • HIP / XPU: filter_expert path → AOT sgl_kernel.silu_and_mul/gelu_and_mul (unfiltered). The downstream fused MoE down kernel writes zeros for filtered experts and returns before reading the activation input (fused_moe_triton_kernels.py:192–208), so writing real output to those rows is harmless. This avoids exposing the JIT activation kernel to AMD users for the first time in this PR.

Removed

  • act_and_mul_kernel (Triton) and act_and_mul_triton wrapper, plus the now-unused _apply_activation and tanh helpers in fused_moe_triton_kernels.py.

Tests (python/sglang/jit_kernel/tests/test_activation.py)

  • Added test_activation_filter_expert parametrized over op × dtype × shape × expert_step ∈ {1, 16} (per-token and sorted/TMA routing).
  • Added test_activation_filter_expert_all_skipped and test_activation_filter_expert_none_skipped edge cases (the latter asserts bit-exact equality with the unfiltered path).

Benchmark (python/sglang/jit_kernel/benchmark/bench_activation.py)

  • Added benchmark_filter comparing the filtered JIT path vs the unfiltered baseline across batch × dim × skip ratio.

Accuracy Tests

pytest python/sglang/jit_kernel/tests/test_activation.py348 passed in 23s (135 pre-existing + 213 new filter_expert).

The none_skipped test asserts bit-exact equality between the filtered kernel (with all expert_ids = 0) and the unfiltered kernel, so the filter machinery does not perturb the math.

Speed Tests and Profiling

The new filter benchmark (bench_activation.py::benchmark_filter, bf16):

op dim bs skip Unfiltered (μs) Filtered (μs)
silu 8192 1024 0.00 30.26 29.85
silu 8192 1024 0.25 30.27 23.87
silu 8192 1024 0.50 30.26 16.13
silu 4096 16384 0.00 213.89 213.63
silu 4096 16384 0.25 213.86 162.16
silu 4096 16384 0.50 213.83 110.20
gelu 8192 1024 0.50 29.96 16.63
gelu 4096 16384 0.50 214.03 116.92
  • At skip_ratio=0: filter overhead is bounded (≤ ~0.3μs absolute, negligible at large shapes — e.g. 213.89 → 213.63μs).
  • At skip_ratio>0: work scales linearly with skipped rows, exactly as expected.

Compared to the deleted Triton act_and_mul_triton kernel (measured against a local copy before removal), the JIT path is ~2–3× faster in the launch-bound regime (small/medium batches, the regime MoE expert tiles hit during decode) and within ±5% at HBM-bandwidth-bound shapes.

Checklist

Notes for reviewers

  • AMD/HIP coverage: The JIT activation kernel still has no AMD CI suite (the pr-test-jit-kernel.yml workflow runs only --hw cuda). To avoid making this PR a "first run on AMD" gamble, the HIP filter_expert path falls back to the unfiltered AOT sgl_kernel kernel and accepts the small wasted compute on filtered rows. A follow-up PR registering test_activation.py on AMD CI would let us route HIP filter_expert through the JIT kernel too.
  • The PR deletes _apply_activation and tanh from fused_moe_triton_kernels.py — these had no other callers (verified via grep across the repo).

🤖 Generated with Claude Code

…gelu_and_mul

Replace the Triton act_and_mul_triton kernel with an extension to the JIT
CUDA silu_and_mul / gelu_and_mul kernels: they now accept optional
expert_ids / expert_step kwargs and skip rows whose routed expert is -1.

CUDA filter_expert paths in fused_moe.py route through the new JIT path.
HIP keeps using the AOT sgl_kernel silu_and_mul / gelu_and_mul for both
filtered and unfiltered cases — the downstream fused_moe down kernel
writes zeros for filtered experts before reading their input rows
(fused_moe_triton_kernels.py:192-208), so writing a real activation to
those rows is harmless. This avoids exercising the JIT activation kernel
on AMD for the first time in this PR.

Tests: extend test_activation.py with filter_expert coverage (per-token
and sorted/TMA layouts, all-skipped, none-skipped); 348 tests pass.

Benchmark: bench_activation.py adds an unfiltered-vs-filtered comparison
that confirms the expert_ids skip path costs <0.3μs and scales work
linearly with skip ratio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces expert-based filtering to the JIT activation kernels, allowing computation to be skipped for tokens based on expert IDs. The implementation replaces the previous Triton-based filtered activation with a unified CUDA kernel and updates the MoE runner to utilize this new path. Feedback was provided to include safety checks for the expert_ids tensor, specifically verifying its device compatibility and dimensionality to prevent potential runtime crashes or illegal memory access.

Comment on lines +171 to +174
using namespace host;
RuntimeCheck(is_type<int32_t>(expert_ids.dtype()), "expert_ids must have dtype int32");
RuntimeCheck(expert_step >= 1, "expert_step must be positive");
launch(input, out, type, static_cast<const int32_t*>(expert_ids.data_ptr()), static_cast<uint32_t>(expert_step));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expert_ids tensor should be verified to be on the same device as the input and out tensors. Accessing a CPU tensor's data pointer from a CUDA kernel will lead to a segmentation fault or illegal memory access. Additionally, verifying that expert_ids is a 1D tensor ensures the indexing logic in the kernel remains valid.

    using namespace host;
    RuntimeCheck(is_type<int32_t>(expert_ids.dtype()), "expert_ids must have dtype int32");
    RuntimeCheck(expert_ids.device().device_type == input.device().device_type &&
                 expert_ids.device().device_id == input.device().device_id,
                 "expert_ids must be on the same device as input");
    RuntimeCheck(expert_ids.ndim() == 1, "expert_ids must be a 1D tensor");
    RuntimeCheck(expert_step >= 1, "expert_step must be positive");
    launch(input, out, type, static_cast<const int32_t*>(expert_ids.data_ptr()), static_cast<uint32_t>(expert_step));

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented Apr 25, 2026

/tag-and-rerun-ci

ch-wan and others added 3 commits April 25, 2026 13:58
Remove section banners and trailing comments that restate the next line
of code. Keep load-bearing WHY: the HIP/XPU fall-through note in
fused_moe.py (downstream zero-write makes it safe), test docstrings,
and the sentinel-NaN rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ch-wan ch-wan merged commit c7878db into main Apr 26, 2026
21 of 47 checks passed
@ch-wan ch-wan deleted the deprecate-act-and-mul-triton branch April 26, 2026 08:41
@hnyls2002 hnyls2002 mentioned this pull request Apr 29, 2026
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
…gelu_and_mul (sgl-project#23707)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant