Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron" by nv-yunzheq · Pull Request #2451 · flashinfer-ai/flashinfer

nv-yunzheq · 2026-01-31T02:04:04Z

Reverts #2304

As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel

Summary by CodeRabbit

Release Notes

Refactor
- Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming.
- Unified intermediate size calculations to consistently use 2x configuration.
- Streamlined routing logic for improved clarity and maintainability.
Breaking Changes
- CLI argument --activation-type renamed to --gated-act with values "swiglu" or "geglu".
- API parameter names updated from activation_type to gated_act_type across public interfaces.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…rt Nemot…" This reverts commit 83cdea3.

coderabbitai · 2026-01-31T02:04:25Z

📝 Walkthrough

Walkthrough

The pull request systematically replaces the generic ActivationType enum with a specialized GatedActType enum (SwiGlu = 0, GeGlu = 1) throughout the flashinfer codebase. This involves updating function signatures, kernel launchers, public API exports, benchmarking utilities, and test implementations to use gated_act_type instead of activation_type, while removing non-gated activation handling code paths.

Changes

Cohort / File(s)	Summary
Public API Exports `flashinfer/__init__.py`, `flashinfer/fused_moe/__init__.py`	Removed `ActivationType` from public imports and added `GatedActType` to module exports.
Core MoE Implementation `flashinfer/fused_moe/core.py`	Introduced `GatedActType` enum, replaced `activation_type` parameters with `gated_act_type` across constructor signatures, removed gated activation branching logic, and updated all trtllm_moe operation invocations to use the new parameter.
Header Declarations `include/flashinfer/trtllm/batched_gemm/KernelRunner.h`, `include/flashinfer/trtllm/fused_moe/runner.h`	Removed `EltwiseActType` enum and field, renamed `useShuffledMatrix` to `useShuffledMatrixA`, replaced `ActivationType` with `GatedActType` in constructor parameters, and removed activation serialization helpers.
Routing & Kernel Headers `include/flashinfer/trtllm/fused_moe/DevKernel.h`, `include/flashinfer/trtllm/fused_moe/RoutingKernel.h`	Removed `numTopExperts` parameter from routing macros and eliminated `MaxNumTopExperts_` template parameter from KernelParams.
CUDA Kernel Launchers `csrc/trtllm_fused_moe_kernel_launcher.cu`	Replaced `ActivationType activation_type` member with `GatedActType gated_act_type`, updated `getValidConfigs` signatures to use `gated_act_type`, and adjusted all runner instantiations and validation checks to use the new gated activation type.
MoE Runner Implementation `csrc/trtllm_fused_moe_runner.cu`	Replaced `ActivationType`-based gating logic with `GatedActType`, tightened DeepSeek topK constraint from 22 to 8, unified intermediate size handling to constant 2x, removed activation type conditional branching, and updated constructor signatures to use `gated_act_type` and `useShuffledMatrixA`.
Batched GEMM & Routing Kernels `csrc/trtllm_batched_gemm_runner.cu`, `csrc/trtllm_fused_moe_routing_deepseek.cu`	Simplified shuffled matrix configuration check in GEMM runner, removed eltwise activation type validation, eliminated `EltwiseActType` printing, and refactored DeepSeek routing constants (consolidated `MaxNumTopExperts`, removed Nemotron-specific branches).
Benchmark Utilities `benchmarks/routines/flashinfer_benchmark_utils.py`, `benchmarks/routines/moe.py`, `benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`	Removed `enum_type` argparse helper, replaced `--activation-type` CLI argument with `--gated_act` (choices: swiglu/geglu), removed `activation_type` parameter from benchmark functions, and added internal `gated_act_type` normalization (swiglu → 0, geglu → 1).
MoE Test Suite `tests/moe/test_dpsk_fused_moe_fp8.py`, `tests/moe/test_trtllm_gen_fused_moe.py`, `tests/moe/test_trtllm_gen_routed_fused_moe.py`, `tests/moe/utils.py`	Replaced `ActivationType` imports with `GatedActType`, updated test calls to use `gated_act_type` parameter, removed `is_gated_activation()` helper and non-gated activation checks, removed `quant_mode` property from Moe base class, and refactored skip_checks logic to use GatedActType values directly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

refactor: update dpsk fused_moe test [2] #2097: Both PRs modify MoE test wiring to replace ActivationType with gated activation abstraction and adjust test skip logic.
Feature: Support non-gated activation in cutlass fused MoE nvfp4 #2011: Both PRs modify activation type handling in fused MoE codepaths, with one converting to GatedActType while the other extends non-gated activation support.
Update trtllm-gen fused moe routing kernel and add more kernels #1955: Both PRs make changes to trtllm fused MoE kernel, routing code, and launcher interfaces within the same subsystems.

Suggested labels

run-ci

Suggested reviewers

joker-eph
aleozlx
djmmoss
cyx-6
yzh119
nvmbreughe

Poem

🐰 The types have shifted, swift and clean,
From ActivationType to GatedActType's sheen!
SwiGlu hops through every lane,
Simpler kernels, no branching pain!
A hoppy refactor, if I may say! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly indicates this is a revert of a specific feature commit, providing enough context for developers scanning history.
Description check	✅ Passed	The description explains what is being reverted and the specific reasons (regression and expert count constraint), fulfilling the main purpose despite minimal template adherence.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch revert-2304-fused-moe-non-gated-fp8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-31T02:04:42Z

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a full revert of a prior feature that aimed to introduce non-gated Relu2 activation for Fused MoE operations in NVFP4 and FP8, alongside Nemotron support. The decision to revert was made due to the feature causing unit test regressions and imposing an undesirable restriction on the minimum number of experts for the trtllm deepseek routing kernel. The changes effectively remove the problematic additions, simplify the activation type definitions to focus solely on gated activations, and adjust routing parameters to ensure stability and compatibility.

Highlights

Feature Revert: Reverted the previously introduced support for Fused MoE non-gated Relu2 activation in NVFP4 & FP8, and Nemotron-specific configurations.
Regression Fix: Addressed regressions on unit tests and a limitation that prevented the trtllm deepseek routing kernel from running with a low number of experts.
Type System Refinement: Replaced the generic ActivationType enum with a more specific GatedActType enum, streamlining activation handling to only gated types (SwiGlu, GeGlu).
Routing Constraints Adjustment: Reduced the maximum supported topK for DeepSeek routing from 22 to 8, aligning with current kernel capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request reverts the feature supporting Fused MoE non-gated Relu2 and Nemotron. The changes are extensive, touching Python benchmarks, tests, and C++/CUDA source code. The revert appears to be thorough and consistent across the codebase, correctly removing the added functionalities and adjusting related logic. I've identified one potential bug in the CUDA kernel logic that could lead to out-of-bounds memory access, which I've detailed in a specific comment.

gemini-code-assist · 2026-01-31T02:06:45Z

csrc/trtllm_fused_moe_routing_deepseek.cu

      __syncthreads();
      if (warpIdx == 0) {
-        int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;
+        int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1;


This calculation for NumInterTopKPerThread seems incorrect and could lead to an out-of-bounds access in the following loop.

NumInterTopK is defined as NumExpertWarps * MaxNumTopExperts. The shared memory arrays smemInterTopScores and smemInterTopExperts are of size NumInterTopK. The loop starting at line 195 iterates up to NumInterTopKPerThread * WarpSize, which approximates to NumInterTopK * NumExpertWarps. If NumExpertWarps > 1, this will cause out-of-bounds access to the shared memory arrays.

The previous implementation (NumInterTopK - 1) / WarpSize + 1 seems correct for calculating the number of elements per thread for the reduction. I suggest reverting to that.

int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

flashinfer/fused_moe/core.py (1)
1939-1947: ⚠️ Potential issue | 🟡 Minor

Silence unused gated_act_type in the fake op.

Static analysis flags this as unused; add a no-op reference to keep the signature aligned with the real op.
🧹 Suggested fix
 def _fake_trtllm_fp4_block_scale_moe(
     routing_logits: torch.Tensor,
@@
     gated_act_type: int,
     output: Optional[torch.Tensor],
     tune_max_num_tokens: int,
 ):
+    _ = gated_act_type  # keep signature in sync with real op
     seq_len = hidden_states.shape[0]
     hidden_size = hidden_states.shape[1] if output is None else output.shape[1]
tests/moe/test_trtllm_gen_fused_moe.py (1)
2094-2108: ⚠️ Potential issue | 🟡 Minor

Reference paths hardcode SwiGlu; GeGlu cases may miscompare.

If GeGlu is supported for these modes, pass through args.gated_act_type. If not, make sure skip_checks explicitly skips GeGlu for these impls.
🛠️ Suggested fix (pass through gated_act_type)
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,
Also applies to: 2131-2145, 2162-2176

🧹 Nitpick comments (1)

csrc/trtllm_fused_moe_routing_deepseek.cu (1)
120-192: Reduce per‑thread scratch size in the inter‑warp top‑K merge.

NumInterTopK already includes NumExpertWarps, so multiplying by NumExpertWarps again inflates NumInterTopKPerThread and per‑thread arrays. Consider using a simple ceil division to avoid extra register pressure.
♻️ Suggested adjustment
-        int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1;
+        int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;

yongwww · 2026-01-31T05:10:29Z

/bot run

flashinfer-bot · 2026-01-31T05:11:49Z

GitLab MR !284 has been created, and the CI pipeline #42953047 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-01T03:21:10Z

[FAILED] Pipeline #42953047: 10/20 passed

…rt Nemotron" (flashinfer-ai#2451) Reverts flashinfer-ai#2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel  ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and suppo…

1ece4af

…rt Nemot…" This reverts commit 83cdea3.

nv-yunzheq requested review from Anerudhan, IwakuraRein, aleozlx, bkryu, cyx-6, djmmoss, jiahanc, jimmyzho, joker-eph, kahyunnam, nvmbreughe and yzh119 as code owners January 31, 2026 02:04

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

yzh119 approved these changes Jan 31, 2026

View reviewed changes

yongwww added the run-ci label Jan 31, 2026

yzh119 merged commit 87a45d1 into main Feb 1, 2026
43 checks passed

yzh119 deleted the revert-2304-fused-moe-non-gated-fp8 branch February 1, 2026 04:59

coderabbitai bot mentioned this pull request Feb 4, 2026

refactor: pull trtllm-gen batch-gemm/gemm headers from artifactory; update tma descriptor shape init #2235

Merged

5 tasks

This was referenced Mar 8, 2026

feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 #2707

Merged

feat: Fuse shared experts into trtllm_gen moe (fp8) #2625

Open

[feat] trtllm-gen mxfp8 gemm #2653

Merged

[feat] Add 2048 experts and 32 Top K #2744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron"#2451

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron"#2451
yzh119 merged 1 commit intomainfrom
revert-2304-fused-moe-non-gated-fp8

nv-yunzheq commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

yongwww commented Jan 31, 2026

Uh oh!

flashinfer-bot commented Jan 31, 2026

Uh oh!

flashinfer-bot commented Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nv-yunzheq commented Jan 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yongwww commented Jan 31, 2026

Uh oh!

flashinfer-bot commented Jan 31, 2026

Uh oh!

flashinfer-bot commented Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-yunzheq commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 31, 2026 •

edited

Loading