Skip to content

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron"#2451

Merged
yzh119 merged 1 commit intomainfrom
revert-2304-fused-moe-non-gated-fp8
Feb 1, 2026
Merged

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron"#2451
yzh119 merged 1 commit intomainfrom
revert-2304-fused-moe-non-gated-fp8

Conversation

@nv-yunzheq
Copy link
Collaborator

@nv-yunzheq nv-yunzheq commented Jan 31, 2026

Reverts #2304

As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel

Summary by CodeRabbit

Release Notes

  • Refactor

    • Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming.
    • Unified intermediate size calculations to consistently use 2x configuration.
    • Streamlined routing logic for improved clarity and maintainability.
  • Breaking Changes

    • CLI argument --activation-type renamed to --gated-act with values "swiglu" or "geglu".
    • API parameter names updated from activation_type to gated_act_type across public interfaces.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

The pull request systematically replaces the generic ActivationType enum with a specialized GatedActType enum (SwiGlu = 0, GeGlu = 1) throughout the flashinfer codebase. This involves updating function signatures, kernel launchers, public API exports, benchmarking utilities, and test implementations to use gated_act_type instead of activation_type, while removing non-gated activation handling code paths.

Changes

Cohort / File(s) Summary
Public API Exports
flashinfer/__init__.py, flashinfer/fused_moe/__init__.py
Removed ActivationType from public imports and added GatedActType to module exports.
Core MoE Implementation
flashinfer/fused_moe/core.py
Introduced GatedActType enum, replaced activation_type parameters with gated_act_type across constructor signatures, removed gated activation branching logic, and updated all trtllm_moe operation invocations to use the new parameter.
Header Declarations
include/flashinfer/trtllm/batched_gemm/KernelRunner.h, include/flashinfer/trtllm/fused_moe/runner.h
Removed EltwiseActType enum and field, renamed useShuffledMatrix to useShuffledMatrixA, replaced ActivationType with GatedActType in constructor parameters, and removed activation serialization helpers.
Routing & Kernel Headers
include/flashinfer/trtllm/fused_moe/DevKernel.h, include/flashinfer/trtllm/fused_moe/RoutingKernel.h
Removed numTopExperts parameter from routing macros and eliminated MaxNumTopExperts_ template parameter from KernelParams.
CUDA Kernel Launchers
csrc/trtllm_fused_moe_kernel_launcher.cu
Replaced ActivationType activation_type member with GatedActType gated_act_type, updated getValidConfigs signatures to use gated_act_type, and adjusted all runner instantiations and validation checks to use the new gated activation type.
MoE Runner Implementation
csrc/trtllm_fused_moe_runner.cu
Replaced ActivationType-based gating logic with GatedActType, tightened DeepSeek topK constraint from 22 to 8, unified intermediate size handling to constant 2x, removed activation type conditional branching, and updated constructor signatures to use gated_act_type and useShuffledMatrixA.
Batched GEMM & Routing Kernels
csrc/trtllm_batched_gemm_runner.cu, csrc/trtllm_fused_moe_routing_deepseek.cu
Simplified shuffled matrix configuration check in GEMM runner, removed eltwise activation type validation, eliminated EltwiseActType printing, and refactored DeepSeek routing constants (consolidated MaxNumTopExperts, removed Nemotron-specific branches).
Benchmark Utilities
benchmarks/routines/flashinfer_benchmark_utils.py, benchmarks/routines/moe.py, benchmarks/bench_trtllm_gen_fused_moe_autotuner.py
Removed enum_type argparse helper, replaced --activation-type CLI argument with --gated_act (choices: swiglu/geglu), removed activation_type parameter from benchmark functions, and added internal gated_act_type normalization (swiglu → 0, geglu → 1).
MoE Test Suite
tests/moe/test_dpsk_fused_moe_fp8.py, tests/moe/test_trtllm_gen_fused_moe.py, tests/moe/test_trtllm_gen_routed_fused_moe.py, tests/moe/utils.py
Replaced ActivationType imports with GatedActType, updated test calls to use gated_act_type parameter, removed is_gated_activation() helper and non-gated activation checks, removed quant_mode property from Moe base class, and refactored skip_checks logic to use GatedActType values directly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • joker-eph
  • aleozlx
  • djmmoss
  • cyx-6
  • yzh119
  • nvmbreughe

Poem

🐰 The types have shifted, swift and clean,
From ActivationType to GatedActType's sheen!
SwiGlu hops through every lane,
Simpler kernels, no branching pain!
A hoppy refactor, if I may say! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly indicates this is a revert of a specific feature commit, providing enough context for developers scanning history.
Description check ✅ Passed The description explains what is being reverted and the specific reasons (regression and expert count constraint), fulfilling the main purpose despite minimal template adherence.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch revert-2304-fused-moe-non-gated-fp8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a full revert of a prior feature that aimed to introduce non-gated Relu2 activation for Fused MoE operations in NVFP4 and FP8, alongside Nemotron support. The decision to revert was made due to the feature causing unit test regressions and imposing an undesirable restriction on the minimum number of experts for the trtllm deepseek routing kernel. The changes effectively remove the problematic additions, simplify the activation type definitions to focus solely on gated activations, and adjust routing parameters to ensure stability and compatibility.

Highlights

  • Feature Revert: Reverted the previously introduced support for Fused MoE non-gated Relu2 activation in NVFP4 & FP8, and Nemotron-specific configurations.
  • Regression Fix: Addressed regressions on unit tests and a limitation that prevented the trtllm deepseek routing kernel from running with a low number of experts.
  • Type System Refinement: Replaced the generic ActivationType enum with a more specific GatedActType enum, streamlining activation handling to only gated types (SwiGlu, GeGlu).
  • Routing Constraints Adjustment: Reduced the maximum supported topK for DeepSeek routing from 22 to 8, aligning with current kernel capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the feature supporting Fused MoE non-gated Relu2 and Nemotron. The changes are extensive, touching Python benchmarks, tests, and C++/CUDA source code. The revert appears to be thorough and consistent across the codebase, correctly removing the added functionalities and adjusting related logic. I've identified one potential bug in the CUDA kernel logic that could lead to out-of-bounds memory access, which I've detailed in a specific comment.

__syncthreads();
if (warpIdx == 0) {
int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;
int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This calculation for NumInterTopKPerThread seems incorrect and could lead to an out-of-bounds access in the following loop.

NumInterTopK is defined as NumExpertWarps * MaxNumTopExperts. The shared memory arrays smemInterTopScores and smemInterTopExperts are of size NumInterTopK. The loop starting at line 195 iterates up to NumInterTopKPerThread * WarpSize, which approximates to NumInterTopK * NumExpertWarps. If NumExpertWarps > 1, this will cause out-of-bounds access to the shared memory arrays.

The previous implementation (NumInterTopK - 1) / WarpSize + 1 seems correct for calculating the number of elements per thread for the reduction. I suggest reverting to that.

        int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (1)

1939-1947: ⚠️ Potential issue | 🟡 Minor

Silence unused gated_act_type in the fake op.

Static analysis flags this as unused; add a no-op reference to keep the signature aligned with the real op.

🧹 Suggested fix
 def _fake_trtllm_fp4_block_scale_moe(
     routing_logits: torch.Tensor,
@@
     gated_act_type: int,
     output: Optional[torch.Tensor],
     tune_max_num_tokens: int,
 ):
+    _ = gated_act_type  # keep signature in sync with real op
     seq_len = hidden_states.shape[0]
     hidden_size = hidden_states.shape[1] if output is None else output.shape[1]
tests/moe/test_trtllm_gen_fused_moe.py (1)

2094-2108: ⚠️ Potential issue | 🟡 Minor

Reference paths hardcode SwiGlu; GeGlu cases may miscompare.

If GeGlu is supported for these modes, pass through args.gated_act_type. If not, make sure skip_checks explicitly skips GeGlu for these impls.

🛠️ Suggested fix (pass through gated_act_type)
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,
-        GatedActType.SwiGlu.value,  # gated_act_type
+        args.gated_act_type,

Also applies to: 2131-2145, 2162-2176

🧹 Nitpick comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)

120-192: Reduce per‑thread scratch size in the inter‑warp top‑K merge.

NumInterTopK already includes NumExpertWarps, so multiplying by NumExpertWarps again inflates NumInterTopKPerThread and per‑thread arrays. Consider using a simple ceil division to avoid extra register pressure.

♻️ Suggested adjustment
-        int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1;
+        int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;

@yongwww yongwww added the run-ci label Jan 31, 2026
@yongwww
Copy link
Member

yongwww commented Jan 31, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !284 has been created, and the CI pipeline #42953047 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #42953047: 10/20 passed

@yzh119 yzh119 merged commit 87a45d1 into main Feb 1, 2026
43 checks passed
@yzh119 yzh119 deleted the revert-2304-fused-moe-non-gated-fp8 branch February 1, 2026 04:59
raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
…rt Nemotron" (flashinfer-ai#2451)

Reverts flashinfer-ai#2304

As it introduces regression on unit test and no longer allow number of
experts lower than 22 to run trtllm deepseek routing kernel

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Refactor**
* Consolidated gated activation type handling across MoE implementations
with simplified parameter names and enum naming.
* Unified intermediate size calculations to consistently use 2x
configuration.
  * Streamlined routing logic for improved clarity and maintainability.

* **Breaking Changes**
* CLI argument `--activation-type` renamed to `--gated-act` with values
"swiglu" or "geglu".
* API parameter names updated from `activation_type` to `gated_act_type`
across public interfaces.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants