Skip to content

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed#2462

Merged
yzh119 merged 31 commits intoflashinfer-ai:mainfrom
amitz-nv:fused-moe-non-gated-fp8
Feb 4, 2026
Merged

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed#2462
yzh119 merged 31 commits intoflashinfer-ai:mainfrom
amitz-nv:fused-moe-non-gated-fp8

Conversation

@amitz-nv
Copy link
Contributor

@amitz-nv amitz-nv commented Feb 2, 2026

📌 Description

NOTE: This is the fixed version of #2304 that was merged and reverted.

  • Replaced the problematic condition in deepseek routing that required NumExperts >= MaxSupportedTopExperts with topK<=numExperts
    • DeepSeek R1 works with it (tested with VLLM).
  • Removed irrelevant test cases.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Refactor

    • Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported).
    • Propagated activation_type across MoE workflows and kernels.
  • New Features

    • Added CLI option --activation-type to select activation kind for MoE benchmarks.
  • Bug Fixes

    • Enforced activation compatibility and validation for FP8/FP4 paths.
  • Tests

    • Updated and expanded tests to cover new activation types and compatibility scenarios.

… adjusted test

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…gated_activation function in tests to tests/moe/utils.py

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…Type enum from core.py, update docstrings

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…_moe

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…ests skip_checks to skip on non-gated activation with quantizations that don't support it

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…seek routing for nemotron

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…hmarks

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…fix)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…tions

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…py for trtllm_fp8_block_scale_moe

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…k.cu

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

Replaces GatedActType with a broader ActivationType across Python benchmarks/CLI, test utilities, C++ runners/launchers, headers, and CUDA kernels; threads activation_type through benchmark/autotuner/runner/kernel call paths; adds eltwise activation support and extends DeepSeek top-experts/top-K sizing and launch macros.

Changes

Cohort / File(s) Summary
Public API exports
flashinfer/__init__.py, flashinfer/fused_moe/__init__.py
Exported GatedActType removed and ActivationType added to public imports/exports.
Python MoE & Benchmarks
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py, benchmarks/routines/flashinfer_benchmark_utils.py, benchmarks/routines/moe.py, flashinfer/fused_moe/core.py
Added activation_type parameter and --activation-type CLI (uses new enum_type argparse helper); threaded activation_type through FP8/FP4 autotuner and benchmark flows; replaced gated_act references with ActivationType propagation.
Tests & Test Utilities
tests/moe/*, tests/moe/utils.py, tests/moe/test_dpsk_fused_moe_fp8.py, tests/moe/test_trtllm_gen_routed_fused_moe.py
Replaced gated_act_type with activation_type; added is_gated_activation and NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES; updated skip_checks, test parametrizations, and expected routing/quant compatibility checks.
C++ MoE Runner / Permute/Gemm
include/flashinfer/trtllm/fused_moe/runner.h, csrc/trtllm_fused_moe_runner.cu, csrc/trtllm_batched_gemm_runner.cu
Introduced ActivationType enum and mapping helpers (activation→gated/eltwise); added mActType and updated constructors/getOptions/signatures to accept ActivationType; adjusted workspace/intermediateSizeFactor and enforced eltwiseActType compatibility in batched gemm runner.
C++ Launchers & Kernel Launch
csrc/trtllm_fused_moe_kernel_launcher.cu
Replaced GatedActType with ActivationType across launcher implementations; updated many init/getValidConfigs signatures and members to use ActivationType; unified SwiGlu→Swiglu naming.
Routing / DeepSeek kernel macros & sizing
include/flashinfer/trtllm/fused_moe/DevKernel.h, include/flashinfer/trtllm/fused_moe/RoutingKernel.h, csrc/trtllm_fused_moe_routing_deepseek.cu
Added MaxNumTopExperts template parameter and constants; expanded top-K buffer sizing and sentinel handling; updated LAUNCH_ROUTING_DEEPSEEK macros to accept/pass numTopExperts; adjusted DeepSeek top-K limits and group/top-K checks.
BatchedGemm Options / Eltwise
include/flashinfer/trtllm/batched_gemm/KernelRunner.h
Added EltwiseActType enum and eltwiseActType field to TrtllmGenBatchedGemmRunnerOptions; renamed useShuffledMatrixAuseShuffledMatrix.
Misc / core changes
flashinfer/fused_moe/core.py, headers/constructors across repo
Removed GatedActType enum, migrated constructors/signatures to activation_type, added serialize/isGated helpers, and forwarded activation_type into C++/CUDA backends and tactic selection flows.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Bench
    participant Py as Python MoE Routines
    participant Runner as C++ MoE Runner
    participant Kernel as CUDA Kernel

    CLI->>Py: parse --activation-type (enum)
    CLI->>Py: invoke benchmark/autotuner with activation_type
    Py->>Runner: call MoE entry (activation_type.value, numTopExperts, eltwiseActType)
    Runner->>Runner: map activation_type → gated/eltwise, compute intermediateSizeFactor
    Runner->>Kernel: launch kernel with activation_type and top-K params
    Kernel-->>Runner: return timings/results
    Runner-->>Py: return results (include activation_type.name)
    Py-->>CLI: print/store results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

v0.6.2

Suggested reviewers

  • joker-eph
  • aleozlx
  • cyx-6
  • djmmoss
  • IwakuraRein
  • yzh119

Poem

🐇 I hopped from SwiGlu to ActivationType so spry,
I threaded my value from CLI up to the sky,
Top-K grew wider, kernels learned my name,
A rabbit rewired launchers — what a game! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main changes: adding non-gated Relu2 activation support in NVFP4/FP8, introducing ActivationType enum, adding Nemotron support, and indicating this is a fixed version.
Description check ✅ Passed Description covers all main points: element-wise activation support, ActivationType replacement, Nemotron support, naming change, and fixes from the previous reverted PR. Pre-commit completed and test status addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amitz-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the capabilities of fused Mixture-of-Experts (MoE) operations by introducing a more flexible activation function system. It adds support for the Relu2 activation, crucial for certain model architectures, and integrates compatibility for Nemotron models within the DeepSeek routing mechanism. The changes also include important fixes to DeepSeek routing logic and general code cleanup, ensuring broader applicability and robustness of the MoE implementation.

Highlights

  • Activation Type Refactoring: The GatedActType enum has been replaced with a more comprehensive ActivationType enum across both Python and C++ codebases. This new enum supports a wider range of activation functions, including Relu2 and Identity, alongside existing Swiglu and Geglu.
  • Relu2 Activation Support: Introduced support for the Relu2 (squared ReLU) element-wise activation function within fused Mixture-of-Experts (MoE) operations. This is specifically enabled for NVFP4 and FP8PerTensor quantization modes.
  • Nemotron Model Support: DeepSeek routing has been enhanced to support Nemotron models. This includes increasing the maximum supported top_k value from 8 to 22 and adjusting the expert count limits to accommodate models with up to 512 experts.
  • DeepSeek Routing Fixes: A problematic condition (NumExperts >= MaxSupportedTopExperts) that previously affected DeepSeek R1 routing has been removed, improving compatibility, especially with VLLM. Irrelevant test cases related to this condition were also removed.
  • Codebase Modernization: The useShuffledMatrixA parameter has been renamed to useShuffledMatrix for improved clarity and consistency throughout the code.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant feature enhancements and fixes. It adds support for the non-gated Relu2 activation in Fused MoE for NVFP4 and FP8, and support for Nemotron models in DeepSeek routing. A key refactoring is the replacement of GatedActType with a more generic ActivationType enum, which has been consistently applied across both Python and C++ codebases. The changes also include important fixes, such as removing a restrictive check on the number of experts and correcting a bug in the routing kernel. The test suite has been updated to cover these new features, ensuring robustness. Overall, the changes are well-implemented and improve the capabilities and correctness of the library.

Comment on lines +226 to +227
// For simplicity pass set scaleAct to scaleGateC
gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment "For simplicity pass set scaleAct to scaleGateC" suggests this might be a temporary solution. While this might work for the current set of activation functions (e.g., if Relu2 doesn't use mPtrScaleAct), it could lead to latent bugs if new element-wise activations are added that require a specific scaleAct value different from scaleGateC.

To improve clarity and prevent future issues, consider passing scaleAct as a separate parameter to the run function and setting mPtrScaleAct accordingly. If scaleGateC is indeed the correct value for all cases, a more detailed comment explaining why would be beneficial.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)

196-216: ⚠️ Potential issue | 🟡 Minor

Inconsistent sentinel values used for invalid expert indices.

Line 200 uses MaxSupportedExpertCount - 1 (511) as the sentinel for invalid entries, while Line 215 uses KernelParams::MaxNumExperts - 1 for the same purpose. This inconsistency could cause subtle issues during the final top-K reduction if the sentinel values have different sort ordering relative to valid indices.

Consider using a consistent sentinel value across both code paths.

🔧 Suggested fix for consistency
          } else {
            intermidiateScore[ii] = invalidScoreFloat;
-           intermidiateExpert[ii] = KernelParams::MaxNumExperts - 1;
+           intermidiateExpert[ii] = MaxSupportedExpertCount - 1;
          }

518-541: ⚠️ Potential issue | 🔴 Critical

Add validation to enforce topK ≤ 8 for non-Nemotron expert counts, or extend conditional dispatch to all branches.

The LAUNCH_ROUTING_DEEPSEEK macro instantiates non-Nemotron expert configurations (MaxNumExpertsUnit, Deepseek, KimiK2) with MaxNumTopExperts=8, but the global validation at line 560 only enforces mTopK <= 22 without checking expert count. This creates a buffer overflow risk: if a non-Nemotron model passes mTopK > 8 at runtime, the kernel's topScores[MaxNumTopExperts] and topExperts[MaxNumTopExperts] arrays (line 125-126) would overflow when the kernel attempts to write results at indices beyond 7, while kernel logic like lines 191-196 assumes MaxNumTopExperts >= mTopK.

Either:

  1. Add a runtime check preventing non-Nemotron models from requesting mTopK > 8, or
  2. Extend the conditional topK dispatch (lines 531-537) to other expert branches as well.
benchmarks/routines/moe.py (1)

1230-1265: ⚠️ Potential issue | 🟡 Minor

Add validation that activation_type is Swiglu for FP8 block scale MOE benchmark.

The run_fp8_block_moe function doesn't validate the activation type, unlike the autotuner which explicitly rejects non-Swiglu activations. To prevent silent errors and maintain consistency with bench_trtllm_gen_fused_moe_autotuner.py, add a check that raises an error if args.activation_type != ActivationType.Swiglu.

tests/moe/test_trtllm_gen_fused_moe.py (1)

2163-2170: ⚠️ Potential issue | 🟠 Major

Inconsistent activation_type passing: using .value in some places but not others.

At line 2167, args.activation_type.value is passed to moe_args_dequant, but at line 2286 in run_moe_reference_mxint4, args.activation_type is passed directly (without .value). Similarly, lines 2204 and 2235 use .value.

Looking at the moe_args_dequant constructor and run_moe_dequant function, the activation_type is used in a dictionary lookup at line 1956-1960:

activation_type_to_func = {
    ActivationType.Swiglu: F.silu,
    ActivationType.Geglu: F.gelu,
    ActivationType.Relu2: lambda x: F.relu(x) ** 2,
}
activation_func = activation_type_to_func[activation_type]

This expects ActivationType enum values, not integers. Using .value will cause a KeyError since the dict keys are enum members, not integers.

🐛 Proposed fix: Remove `.value` from activation_type arguments
--- a/tests/moe/test_trtllm_gen_fused_moe.py
+++ b/tests/moe/test_trtllm_gen_fused_moe.py
@@ -2164,7 +2164,7 @@ def run_moe_reference_dsfp8(args):
         gemm2_weights_dequant,
         args.permute_info,
         args.use_routing_scales_on_input,
-        args.activation_type.value,
+        args.activation_type,
     )

     return run_moe_dequant(args_dequant, QuantMode.FP8_BLOCK_SCALE), args_dequant
@@ -2201,7 +2201,7 @@ def run_moe_reference_per_tensor_scale_fp8(args):
         gemm2_weights_dequant,
         args.permute_info,
         args.use_routing_scales_on_input,
-        args.activation_type.value,
+        args.activation_type,
     )

     return run_moe_dequant(args_dequant, QuantMode.FP8_PER_TENSOR), args_dequant
@@ -2232,7 +2232,7 @@ def run_moe_reference_bf16(args):
         gemm2_weights_dequant,
         args.permute_info,
         args.use_routing_scales_on_input,
-        args.activation_type.value,
+        args.activation_type,
     )

     return run_moe_dequant(args_dequant, QuantMode.BF16), args_dequant
🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_runner.cu`:
- Around line 196-222: The function activationTypeToGatedActType is missing a
case for ActivationType::SwigluBias which causes the default-check to fire at
runtime; update activationTypeToGatedActType to handle
ActivationType::SwigluBias (returning the appropriate gated enum, e.g.,
ActType::SwiGlu) alongside ActivationType::Swiglu so that
isGatedActivation-consistent callers (like getOptions) no longer hit the
FLASHINFER_CHECK default branch.

In `@tests/moe/utils.py`:
- Around line 90-94: Update the three routing configuration dicts named kimi_k2,
DSv3, and DSLite in test_dpsk_fused_moe_fp8.py to include the required key
"compatible_activation_types" (a list of activation names). Add a
compatible_activation_types entry that includes the activation(s) used by the
tests (for example ["gelu", "relu", "swish"] or the specific activation(s) the
test suite iterates over) so pytest.skip is not triggered by an empty default.
🧹 Nitpick comments (5)
csrc/trtllm_batched_gemm_runner.cu (1)

226-227: Clarify the scaleAct assignment.

The comment indicates this is a simplification. Consider adding a more detailed comment explaining when scaleAct should differ from scaleGateC, or whether this is the correct semantic for all element-wise activation types.

include/flashinfer/trtllm/fused_moe/runner.h (1)

171-172: Consider throwing an error for invalid activation types.

The TODO comment indicates this should throw an error. Currently, returning "InvalidActivationType" silently accepts invalid values.

Proposed fix
     default:
-      return "InvalidActivationType";  // TODO throw error
+      FLASHINFER_CHECK(false, "Invalid activation type: ", static_cast<int64_t>(activationType));
+      return "InvalidActivationType";  // Unreachable
csrc/trtllm_fused_moe_kernel_launcher.cu (1)

419-424: Consider exposing activation_type in BF16 MoE API.

The comment "not exposed in api for now" suggests this is intentional. If non-gated activations (like Relu2) should eventually be supported for BF16 MoE, this would need to be parameterized.

tests/moe/utils.py (1)

105-108: Threshold change from >= 512 to > 512.

This change allows 512 experts with intermediate_size > 512 to run, where previously they were skipped. Verify this is intentional and that these configurations are supported.

tests/moe/test_dpsk_fused_moe_fp8.py (1)

590-591: Redundant CUDA availability check.

Per repository conventions, tests assume CUDA is available and do not require torch.cuda.is_available() guards. This check is redundant but not harmful.

Comment on lines +196 to +222
static inline ActType activationTypeToGatedActType(ActivationType actType) {
switch (actType) {
case ActivationType::Swiglu:
return ActType::SwiGlu;
case ActivationType::Geglu:
return ActType::GeGlu;
default:
FLASHINFER_CHECK(false, "Unsupported gated activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return ActType::SwiGlu;
}

static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) {
switch (actType) {
case ActivationType::Relu2:
return EltwiseActType::Relu2;
case ActivationType::Identity:
return EltwiseActType::None;
default:
FLASHINFER_CHECK(false, "Unsupported eltwise activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return EltwiseActType::None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for SwigluBias in the codebase
rg -n "SwigluBias" --type-list | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 457


🏁 Script executed:

#!/bin/bash
# Find all instances of SwigluBias
rg -n "SwigluBias" --type cpp --type cu --type cuh --type py 2>/dev/null | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Search for ActivationType enum definition and isGatedActivation function
rg -n "isGatedActivation|enum.*ActivationType" --type cpp --type cu --type cuh --type py 2>/dev/null | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Simple search for SwigluBias anywhere in the repository
rg "SwigluBias" -i 2>/dev/null | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 2836


🏁 Script executed:

#!/bin/bash
# Look at the actual file to understand the context
cat -n csrc/trtllm_fused_moe_runner.cu | sed -n '150,250p'

Repository: flashinfer-ai/flashinfer

Length of output: 4650


🏁 Script executed:

#!/bin/bash
# Search for ActivationType enum and isGatedActivation function
rg "enum.*ActivationType|isGatedActivation" -A 10 2>/dev/null | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 6230


Add SwigluBias case to activationTypeToGatedActType function.

The isGatedActivation helper correctly identifies SwigluBias as a gated activation, but activationTypeToGatedActType lacks a case for it, causing a runtime check failure when SwigluBias is used in line 234 of the getOptions function.

Proposed fix
 static inline ActType activationTypeToGatedActType(ActivationType actType) {
   switch (actType) {
     case ActivationType::Swiglu:
       return ActType::SwiGlu;
     case ActivationType::Geglu:
       return ActType::GeGlu;
+    case ActivationType::SwigluBias:
+      return ActType::SwiGlu;
     default:
       FLASHINFER_CHECK(false, "Unsupported gated activation type ",
                        serializeActivationType(actType), " of enum ",
                        static_cast<int64_t>(actType));
   }
   return ActType::SwiGlu;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static inline ActType activationTypeToGatedActType(ActivationType actType) {
switch (actType) {
case ActivationType::Swiglu:
return ActType::SwiGlu;
case ActivationType::Geglu:
return ActType::GeGlu;
default:
FLASHINFER_CHECK(false, "Unsupported gated activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return ActType::SwiGlu;
}
static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) {
switch (actType) {
case ActivationType::Relu2:
return EltwiseActType::Relu2;
case ActivationType::Identity:
return EltwiseActType::None;
default:
FLASHINFER_CHECK(false, "Unsupported eltwise activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return EltwiseActType::None;
}
static inline ActType activationTypeToGatedActType(ActivationType actType) {
switch (actType) {
case ActivationType::Swiglu:
return ActType::SwiGlu;
case ActivationType::Geglu:
return ActType::GeGlu;
case ActivationType::SwigluBias:
return ActType::SwiGlu;
default:
FLASHINFER_CHECK(false, "Unsupported gated activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return ActType::SwiGlu;
}
static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) {
switch (actType) {
case ActivationType::Relu2:
return EltwiseActType::Relu2;
case ActivationType::Identity:
return EltwiseActType::None;
default:
FLASHINFER_CHECK(false, "Unsupported eltwise activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return EltwiseActType::None;
}
🤖 Prompt for AI Agents
In `@csrc/trtllm_fused_moe_runner.cu` around lines 196 - 222, The function
activationTypeToGatedActType is missing a case for ActivationType::SwigluBias
which causes the default-check to fire at runtime; update
activationTypeToGatedActType to handle ActivationType::SwigluBias (returning the
appropriate gated enum, e.g., ActType::SwiGlu) alongside ActivationType::Swiglu
so that isGatedActivation-consistent callers (like getOptions) no longer hit the
FLASHINFER_CHECK default branch.

Comment on lines +90 to +94
compatible_activation_types = routing_config.get("compatible_activation_types", [])
if activation_type not in compatible_activation_types:
pytest.skip(
f"Incompatible: activation_type={activation_type} not in compatible_activation_types ({compatible_activation_types})"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check routing_config definitions and compatible_activation_types usage
rg -n "routing_config" tests/moe/ -B 2 -A 5 | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 6348


🏁 Script executed:

# Find all test files to understand routing_config usage patterns
find tests/moe -name "*.py" -type f

Repository: flashinfer-ai/flashinfer

Length of output: 273


🏁 Script executed:

# Check if compatible_activation_types is present in routing configs
rg -n "compatible_activation_types" tests/moe/

Repository: flashinfer-ai/flashinfer

Length of output: 897


🏁 Script executed:

# Find all routing_config dict definitions in test files
rg -n "\"compatible_activation_types\"" tests/moe/ -B 10 | grep -E "(routing_config|compatible_activation_types)" | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 808


🏁 Script executed:

# Check all test files for routing_config definitions without compatible_activation_types
rg -n "{" tests/moe/test_*.py -A 50 | grep -E "(routing_config\s*=|compatible_activation_types)" | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 144


🏁 Script executed:

# Check if flashinfer.utils functions are imported in utils.py
head -30 tests/moe/utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 979


🏁 Script executed:

# Check for all routing_config dict creations in test files
rg -n "routing_config\s*=" tests/moe/test_*.py -A 25 | head -200

Repository: flashinfer-ai/flashinfer

Length of output: 4185


🏁 Script executed:

# Check routing_config definitions in test_dpsk_fused_moe_fp8.py more thoroughly
rg -n "routing_config\s*=" tests/moe/test_dpsk_fused_moe_fp8.py -A 15 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 1728


🏁 Script executed:

# Check all routing_config definitions in all test files
rg -n "routing_config\s*=" tests/moe/test_trtllm_cutlass_fused_moe.py -A 20 | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check test_trtllm_gen_routed_fused_moe.py for routing_config definitions
rg -n "routing_config\s*=" tests/moe/test_trtllm_gen_routed_fused_moe.py -A 20 | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if compatible_activation_types is missing anywhere in routing configs
rg -n "routing_config\s*=\s*\{" tests/moe/ -A 30 | grep -v "compatible_activation_types" | grep -E "(routing_config|\")" | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 903


🏁 Script executed:

# Check the complete skip_checks function to see if flashinfer.utils is used
sed -n '50,130p' tests/moe/utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 3638


🏁 Script executed:

# Check test_dpsk_fused_moe_fp8.py routing_config more completely
sed -n '747,762p' tests/moe/test_dpsk_fused_moe_fp8.py

Repository: flashinfer-ai/flashinfer

Length of output: 533


🏁 Script executed:

# Find all parametrize decorators with routing_config in MOE test files
rg -n "@pytest.mark.parametrize" tests/moe/test_dpsk_fused_moe_fp8.py -A 30 | head -150

Repository: flashinfer-ai/flashinfer

Length of output: 2409


🏁 Script executed:

# Get the complete routing_config definitions from test_dpsk_fused_moe_fp8.py
sed -n '508,545p' tests/moe/test_dpsk_fused_moe_fp8.py

Repository: flashinfer-ai/flashinfer

Length of output: 1213


🏁 Script executed:

# Check if there are any other routing_config params beyond line 545
sed -n '545,600p' tests/moe/test_dpsk_fused_moe_fp8.py

Repository: flashinfer-ai/flashinfer

Length of output: 1554


🏁 Script executed:

# Verify imports in skip_checks to confirm flashinfer.utils usage
head -25 tests/moe/utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 883


test_dpsk_fused_moe_fp8.py routing configs require compatible_activation_types field

The routing configuration dictionaries in test_dpsk_fused_moe_fp8.py (lines 508–545) are missing the compatible_activation_types field. Without this field, all tests using these configurations will be unexpectedly skipped when compatible_activation_types defaults to an empty list. Update the three routing config definitions (kimi_k2, DSv3, DSLite) to include this required field.

🤖 Prompt for AI Agents
In `@tests/moe/utils.py` around lines 90 - 94, Update the three routing
configuration dicts named kimi_k2, DSv3, and DSLite in
test_dpsk_fused_moe_fp8.py to include the required key
"compatible_activation_types" (a list of activation names). Add a
compatible_activation_types entry that includes the activation(s) used by the
tests (for example ["gelu", "relu", "swish"] or the specific activation(s) the
test suite iterates over) so pytest.skip is not triggered by an empty default.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…ant launched max value

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@yzh119
Copy link
Collaborator

yzh119 commented Feb 2, 2026

@flashinfer-bot run

@yzh119
Copy link
Collaborator

yzh119 commented Feb 2, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !288 has been created, and the CI pipeline #43117758 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[CANCELING] Pipeline #43117758: canceled

@yzh119
Copy link
Collaborator

yzh119 commented Feb 3, 2026

@flashinfer-bot run

@yzh119
Copy link
Collaborator

yzh119 commented Feb 3, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !288 has been updated with latest changes, and the CI pipeline #43168329 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@amitz-nv amitz-nv force-pushed the fused-moe-non-gated-fp8 branch from fa81b33 to df1ae03 Compare February 3, 2026 12:13
@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #43168329: 10/20 passed

@nv-yunzheq
Copy link
Collaborator

LGTM.

@yongwww
Copy link
Member

yongwww commented Feb 4, 2026

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yzh119 yzh119 merged commit e284274 into flashinfer-ai:main Feb 4, 2026
28 of 34 checks passed
raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
…ron, fixed (flashinfer-ai#2462)

<!-- .github/pull_request_template.md -->

## 📌 Description

- Support element wise activation (relu^2) in fused MoE in NVFP4 and in
FP8PerTensor.
- Use new ActivationType enum class instead of GatedActType.
- Support Nemotron in deepseek routing as in
NVIDIA/TensorRT-LLM#9792
- Remove 'A' suffix from UseShuffledMatrixA.

NOTE: This is the fixed version of
flashinfer-ai#2304 that was merged
and reverted.
- Replaced the problematic condition in deepseek routing that required
`NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts`
  - DeepSeek R1 works with it (tested with VLLM).
- Removed irrelevant test cases.


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Replaced old gated-activation API with a unified ActivationType enum
(many activation kinds supported).
  * Propagated activation_type across MoE workflows and kernels.

* **New Features**
* Added CLI option --activation-type to select activation kind for MoE
benchmarks.

* **Bug Fixes**
  * Enforced activation compatibility and validation for FP8/FP4 paths.

* **Tests**
* Updated and expanded tests to cover new activation types and
compatibility scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
- 4: Geglu
- 5: SwigluBias
- 6: Relu2
- 7: Identity
Copy link
Collaborator

@IwakuraRein IwakuraRein Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In csrc/trtllm_fused_moe_runner.cu, the functions activationTypeToGatedActType and activationTypeToEltwiseActType restrict the activation functions to [Swiglu, Geglu, Relu2, Identity]. The comment needs to be updated accordingly.

Additionally, I doubted if Geglu is actually supported for per-tensor fp8. Seems the corresponding cubins are not generated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right the docstring here should probably reflect what's supported instead of detailing the entire enum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants