[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul by ch-wan · Pull Request #23707 · sgl-project/sglang

ch-wan · 2026-04-25T07:18:30Z

Motivation

act_and_mul_triton (in fused_moe_triton_kernels.py) duplicates silu_and_mul / gelu_and_mul. The only difference is that it skips rows whose routed expert id is -1 (the filter_expert=True MoE path used under EP). The JIT CUDA silu_and_mul / gelu_and_mul kernels already exist and are faster — the consolidation removes ~100 lines of Triton and a redundant kernel.

Modifications

JIT activation kernel (CUDA)

python/sglang/jit_kernel/csrc/elementwise/activation.cuh: added expert_ids (const int32_t*) and expert_step (uint32_t) to ActivationParams; added a compile-time kFilterExpert template bool to act_and_mul_kernel (zero overhead when off — if constexpr skips the load); exposed run_activation_filtered host method.
python/sglang/jit_kernel/activation.py: silu_and_mul / gelu_and_mul / gelu_tanh_and_mul / run_activation now accept optional expert_ids and expert_step kwargs. Existing call sites are unchanged.

MoE call sites (triton_utils/fused_moe.py)

Replaced both act_and_mul_triton(...) calls with silu_and_mul / gelu_and_mul passing expert_ids and expert_step.
CUDA: filter_expert path → JIT kernel with expert_ids (skips filtered rows).
HIP / XPU: filter_expert path → AOT sgl_kernel.silu_and_mul/gelu_and_mul (unfiltered). The downstream fused MoE down kernel writes zeros for filtered experts and returns before reading the activation input (fused_moe_triton_kernels.py:192–208), so writing real output to those rows is harmless. This avoids exposing the JIT activation kernel to AMD users for the first time in this PR.

Removed

act_and_mul_kernel (Triton) and act_and_mul_triton wrapper, plus the now-unused _apply_activation and tanh helpers in fused_moe_triton_kernels.py.

Tests (python/sglang/jit_kernel/tests/test_activation.py)

Added test_activation_filter_expert parametrized over op × dtype × shape × expert_step ∈ {1, 16} (per-token and sorted/TMA routing).
Added test_activation_filter_expert_all_skipped and test_activation_filter_expert_none_skipped edge cases (the latter asserts bit-exact equality with the unfiltered path).

Benchmark (python/sglang/jit_kernel/benchmark/bench_activation.py)

Added benchmark_filter comparing the filtered JIT path vs the unfiltered baseline across batch × dim × skip ratio.

Accuracy Tests

pytest python/sglang/jit_kernel/tests/test_activation.py — 348 passed in 23s (135 pre-existing + 213 new filter_expert).

The none_skipped test asserts bit-exact equality between the filtered kernel (with all expert_ids = 0) and the unfiltered kernel, so the filter machinery does not perturb the math.

Speed Tests and Profiling

The new filter benchmark (bench_activation.py::benchmark_filter, bf16):

op	dim	bs	skip	Unfiltered (μs)	Filtered (μs)
silu	8192	1024	0.00	30.26	29.85
silu	8192	1024	0.25	30.27	23.87
silu	8192	1024	0.50	30.26	16.13
silu	4096	16384	0.00	213.89	213.63
silu	4096	16384	0.25	213.86	162.16
silu	4096	16384	0.50	213.83	110.20
gelu	8192	1024	0.50	29.96	16.63
gelu	4096	16384	0.50	214.03	116.92

At skip_ratio=0: filter overhead is bounded (≤ ~0.3μs absolute, negligible at large shapes — e.g. 213.89 → 213.63μs).
At skip_ratio>0: work scales linearly with skipped rows, exactly as expected.

Compared to the deleted Triton act_and_mul_triton kernel (measured against a local copy before removal), the JIT path is ~2–3× faster in the launch-bound regime (small/medium batches, the regime MoE expert tiles hit during decode) and within ±5% at HBM-bandwidth-bound shapes.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Notes for reviewers

AMD/HIP coverage: The JIT activation kernel still has no AMD CI suite (the pr-test-jit-kernel.yml workflow runs only --hw cuda). To avoid making this PR a "first run on AMD" gamble, the HIP filter_expert path falls back to the unfiltered AOT sgl_kernel kernel and accepts the small wasted compute on filtered rows. A follow-up PR registering test_activation.py on AMD CI would let us route HIP filter_expert through the JIT kernel too.
The PR deletes _apply_activation and tanh from fused_moe_triton_kernels.py — these had no other callers (verified via grep across the repo).

🤖 Generated with Claude Code

…gelu_and_mul Replace the Triton act_and_mul_triton kernel with an extension to the JIT CUDA silu_and_mul / gelu_and_mul kernels: they now accept optional expert_ids / expert_step kwargs and skip rows whose routed expert is -1. CUDA filter_expert paths in fused_moe.py route through the new JIT path. HIP keeps using the AOT sgl_kernel silu_and_mul / gelu_and_mul for both filtered and unfiltered cases — the downstream fused_moe down kernel writes zeros for filtered experts before reading their input rows (fused_moe_triton_kernels.py:192-208), so writing a real activation to those rows is harmless. This avoids exercising the JIT activation kernel on AMD for the first time in this PR. Tests: extend test_activation.py with filter_expert coverage (per-token and sorted/TMA layouts, all-skipped, none-skipped); 348 tests pass. Benchmark: bench_activation.py adds an unfiltered-vs-filtered comparison that confirms the expert_ids skip path costs <0.3μs and scales work linearly with skip ratio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces expert-based filtering to the JIT activation kernels, allowing computation to be skipped for tokens based on expert IDs. The implementation replaces the previous Triton-based filtered activation with a unified CUDA kernel and updates the MoE runner to utilize this new path. Feedback was provided to include safety checks for the expert_ids tensor, specifically verifying its device compatibility and dimensionality to prevent potential runtime crashes or illegal memory access.

gemini-code-assist · 2026-04-25T07:24:04Z

+    using namespace host;
+    RuntimeCheck(is_type<int32_t>(expert_ids.dtype()), "expert_ids must have dtype int32");
+    RuntimeCheck(expert_step >= 1, "expert_step must be positive");
+    launch(input, out, type, static_cast<const int32_t*>(expert_ids.data_ptr()), static_cast<uint32_t>(expert_step));


The expert_ids tensor should be verified to be on the same device as the input and out tensors. Accessing a CPU tensor's data pointer from a CUDA kernel will lead to a segmentation fault or illegal memory access. Additionally, verifying that expert_ids is a 1D tensor ensures the indexing logic in the kernel remains valid.

using namespace host; RuntimeCheck(is_type<int32_t>(expert_ids.dtype()), "expert_ids must have dtype int32"); RuntimeCheck(expert_ids.device().device_type == input.device().device_type && expert_ids.device().device_id == input.device().device_id, "expert_ids must be on the same device as input"); RuntimeCheck(expert_ids.ndim() == 1, "expert_ids must be a 1D tensor"); RuntimeCheck(expert_step >= 1, "expert_step must be positive"); launch(input, out, type, static_cast<const int32_t*>(expert_ids.data_ptr()), static_cast<uint32_t>(expert_step));

ch-wan · 2026-04-25T07:47:30Z

/tag-and-rerun-ci

Remove section banners and trailing comments that restate the next line of code. Keep load-bearing WHY: the HIP/XPU fall-through note in fused_moe.py (downstream zero-write makes it safe), test docstrings, and the sentinel-NaN rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ch-wan requested review from BBuf, DarkSharpness, Edwardf0t1, Fridge003, HaiShaw, HydraQYH, Ying1123, celve, ispobock, merrymercy and yuan-luo as code owners April 25, 2026 07:18

github-actions Bot added the jit-kernel label Apr 25, 2026

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 25, 2026

ch-wan and others added 3 commits April 25, 2026 13:58

Merge branch 'main' into deprecate-act-and-mul-triton

8431c37

Merge branch 'main' into deprecate-act-and-mul-triton

da2b843

ch-wan merged commit c7878db into main Apr 26, 2026
21 of 47 checks passed

ch-wan deleted the deprecate-act-and-mul-triton branch April 26, 2026 08:41

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

icepoint666 mentioned this pull request Apr 30, 2026

[BugFix][JIT] Disambiguate activation member template lookup on CUDA 12.8 Hopper runtime #24136

Closed

5 tasks

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/…

43d8348

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul#23707

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul#23707
ch-wan merged 4 commits intomainfrom
deprecate-act-and-mul-triton

ch-wan commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

ch-wan commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ch-wan commented Apr 25, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Notes for reviewers

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

ch-wan commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant