Skip to content

Feature: Support non-gated activation in cutlass fused MoE nvfp4#2011

Merged
yzh119 merged 2 commits intoflashinfer-ai:mainfrom
omera-nv:feat/relu2_moe_fp4
Oct 31, 2025
Merged

Feature: Support non-gated activation in cutlass fused MoE nvfp4#2011
yzh119 merged 2 commits intoflashinfer-ai:mainfrom
omera-nv:feat/relu2_moe_fp4

Conversation

@omera-nv
Copy link
Contributor

@omera-nv omera-nv commented Oct 30, 2025

📌 Description

This PR removes an assertion in the cutlass fused moe bindings to enable non-gated activations in nvfp4.
It also adds a test for this path with relu2 activation.

🔍 Related Issues

N/A

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • [v] I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • [v] I have installed the hooks with pre-commit install.
  • [v] I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • [v] Tests have been added or updated as needed.
  • [v] All tests are passing (unittest, etc.).

Reviewer Notes

N/A

Summary by CodeRabbit

  • New Features
    • Enhanced quantized Mixture of Experts models to support configurable activation types (Swiglu and ReLU2) in the NVFP4 quantization path.
    • Improved parameter handling to correctly adapt weight shapes and quantization settings based on the selected activation type.

Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @omera-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the fused Mixture-of-Experts (MoE) implementation by introducing support for non-gated activation functions, specifically Relu2, within the NVFP4 quantization scheme. The core change involves relaxing a shape assertion in the CUTLASS backend to accommodate the different input dimensions required by non-gated activations, alongside comprehensive testing to ensure correctness and compatibility.

Highlights

  • Non-Gated Activation Support: Removed an assertion in the CUTLASS fused MoE bindings to enable support for non-gated activation functions, such as Relu2, in NVFP4 quantization.
  • Dynamic Shape Assertions: The shape assertion for fc1_weight_block in the getQuantParams function is now conditional, adapting its expected dimensions based on whether the activation type is gated or non-gated.
  • Relu2 Activation Test: Added a new dedicated test case (test_relu2_moe_fp4.py) to validate the functionality of cutlass_fused_moe with Relu2 activation and FP8 quantization.
  • Expanded Test Coverage: The existing test_moe_nvfp4 in test_trtllm_cutlass_fused_moe.py has been updated to parameterize activation_type, allowing it to test both Swiglu (gated) and Relu2 (non-gated) activations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

Added activation type parameter (Swiglu/Relu2) to MoE quantization paths. Modified getQuantParams() to accept base_activation_type and conditionally validate weight shapes based on gated vs non-gated activation logic. Updated test suite to parametrize and propagate activation_type through reference and fused MoE implementations.

Changes

Cohort / File(s) Summary
Cutlass binding activation type threading
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu
Extended getQuantParams() method signature with base_activation_type parameter (default: ActivationType::Swiglu). Conditionalized weight shape validation for NVFP4 quantization paths based on isGatedActivation() check, distinguishing gated (Swiglu) from non-gated (Relu2) activation constraints. Updated all call sites to pass base_activation_type argument.
Test suite activation type parameterization
tests/moe/test_trtllm_cutlass_fused_moe.py
Extended torch_moe_nvfp4() and test functions to accept activation_type parameter. Implemented per-expert activation callbacks: Swiglu path (w1/w3 split with silu) and Relu2 path (squared ReLU). Parametrized test cases across ActivationType.Swiglu and ActivationType.Relu2. Adjusted weight tensor construction (w1_n) and shape handling conditionally based on activation type. Propagated activation_type through reference and fused MoE implementations.

Sequence Diagram

sequenceDiagram
    participant Test as Test Suite
    participant RefImpl as Reference Impl
    participant CutlassBinding as Cutlass Binding
    participant QuantPath as Quant Path (NVFP4)

    Test->>Test: Parameterize activation_type<br/>(Swiglu / Relu2)
    Test->>RefImpl: Call torch_moe_nvfp4<br/>with activation_type
    RefImpl->>RefImpl: Select per-expert act():<br/>Swiglu or Relu2
    RefImpl->>RefImpl: Compute MoE output
    Test->>CutlassBinding: Call fused_moe<br/>with activation_type
    CutlassBinding->>CutlassBinding: getQuantParams()<br/>receives base_activation_type
    CutlassBinding->>QuantPath: Route to conditional<br/>weight validation
    alt isGatedActivation(base_activation_type)
        QuantPath->>QuantPath: Gated constraint:<br/>fc1_weight_block.size(2)<br/>aligned to 2n
    else Non-gated
        QuantPath->>QuantPath: Non-gated constraint:<br/>fc1_weight_block.size(2)<br/>aligned to n
    end
    QuantPath-->>CutlassBinding: Return QuantParams
    CutlassBinding-->>Test: MoE result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Activation type conditionals: Weight shape validation logic branches based on isGatedActivation() require careful verification to ensure both paths (Swiglu/Relu2) are correctly constrained.
  • Parameter threading: Trace base_activation_type propagation from C++ binding through to test implementations to confirm consistency across call sites.
  • Test parameterization: Verify that w1_n construction (2n vs n) and tensor shapes align correctly for each activation type in the test suite.

Possibly related PRs

Suggested reviewers

  • cyx-6
  • yongwww
  • yzh119
  • wenscarl

Poem

🐰 Activation types now bloom in dual array,
Swiglu and Relu2 dance their separate way,
Gated vs non-gated shapes align just right,
MoE bindings flow with quantized light!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Feature: Support non-gated activation in cutlass fused MoE nvfp4" directly and clearly describes the main objective of the changeset. According to the raw summary, the PR removes an assertion in cutlass fused MoE bindings to enable non-gated activations (like relu2) in nvfp4, which is exactly what the title communicates. The title is concise, specific, and accurately reflects the primary feature being added without unnecessary noise or vague terminology.
Description Check ✅ Passed The PR description follows the repository's template structure and includes all required sections. The "Description" section clearly explains that the PR removes an assertion in cutlass fused MoE bindings to enable non-gated activations and adds a test for the relu2 activation path. The "Related Issues" section is appropriately marked as N/A. The "Pull Request Checklist" is complete with pre-commit checks and tests sections properly filled out with checkmarks, and the optional "Reviewer Notes" section is included (marked N/A). All required information is present and substantively addressed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9287c9 and e19f167.

📒 Files selected for processing (2)
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (4 hunks)
  • tests/moe/test_trtllm_cutlass_fused_moe.py (11 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
flashinfer/fused_moe/core.py (1)
  • ActivationType (76-85)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • isGatedActivation (253-256)
🪛 Ruff (0.14.2)
tests/moe/test_trtllm_cutlass_fused_moe.py

165-165: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (8)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (3)

239-239: LGTM! Function signature updates are well-structured.

The addition of base_activation_type parameter with a default value of ActivationType::Swiglu maintains backward compatibility while enabling the new functionality. Parameter threading through runMoe, runMoeMinLantency, and getQuantParams is consistent and correct.

Also applies to: 364-365, 418-418, 545-546, 812-815


293-301: LGTM! Weight size validation correctly handles gated and non-gated activations.

The conditional validation properly accounts for:

  • Gated activations (Swiglu, Geglu, SwigluBias): 2× intermediate size for gate and up projections
  • Non-gated activations (Relu2, etc.): 1× intermediate size

Error messages are clear and descriptive.

Also applies to: 468-476


1017-1043: Excellent! Core change enables non-gated activations for nvfp4.

This change correctly removes the previous unconditional assertion and replaces it with conditional validation based on isGatedActivation(base_activation_type):

  • Gated path: Validates inter_size * 2 alignment (for gate + up projections)
  • Non-gated path: Validates inter_size alignment

The alignment requirements, block scale calculations, and error messages are correctly adjusted for each case. This directly addresses the PR objective of enabling non-gated activations like Relu2 for nvfp4.

tests/moe/test_trtllm_cutlass_fused_moe.py (5)

20-20: LGTM! Import correctly added.

The import of ActivationType is necessary for parameterizing tests with different activation types.


141-185: LGTM! Reference implementation correctly handles both activation types.

The function properly implements:

  • Swiglu: Splits weights into gate (w1) and up (w3) projections, applies F.silu(a @ w1.t()) * (a @ w3.t())
  • Relu2: Applies squared ReLU activation F.relu(a @ weight.t()) ** 2

The conditional logic cleanly handles the different activation patterns, and error handling for unsupported types is appropriate.

Note: Static analysis flagged line 165 (TRY003), but the error message is appropriately concise for this context.


380-384: LGTM! Test parameterization properly covers both activation types.

The parameterization ensures that both Swiglu and Relu2 activation paths are tested, with clear test IDs for easy identification. This aligns with the PR objective of adding test coverage for the relu2 activation path with nvfp4.

Also applies to: 398-398


414-433: LGTM! Weight tensor shapes correctly adjusted for activation type.

The weight tensor construction properly accounts for the different requirements:

  • Swiglu: w1_n = 2 * n to accommodate gate and up projections
  • Relu2: w1_n = n for a single projection

All related tensors (w1, w1_q, w1_q_cutlass, w1_d, w1_blockscale) consistently use w1_n, ensuring proper alignment throughout the test. This correctly mirrors the validation logic in the C++ implementation.

Also applies to: 507-507


492-492: LGTM! Activation type correctly threaded through both implementations.

The activation_type parameter is properly passed to both:

  • The fused CUTLASS implementation (line 492)
  • The reference implementation (line 535)

This ensures both paths use the same activation function for accurate validation of the fused implementation against the reference.

Also applies to: 528-536


Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
@omera-nv omera-nv changed the title Feat/relu2 moe fp4 Feature: Support non-gated activation in cutlass fused MoE nvfp4 Oct 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables non-gated activations like ReLU2 for nvfp4 in the CUTLASS fused MoE kernels. This is achieved by adjusting the weight shape checks based on the activation type. The changes are accompanied by updates to the test suite, including parameterizing an existing test for different activation functions and adding a new test for the relu2 path.

My review focuses on improving code maintainability and test correctness. I've suggested refactoring a piece of duplicated code in the C++ bindings to make it more concise. More importantly, I've pointed out that the new test file lacks assertions to verify the correctness of the computation, which is a critical omission for a test.

@yzh119
Copy link
Collaborator

yzh119 commented Oct 31, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !102 has been created, and the CI pipeline #37646195 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[SUCCESS] Pipeline #37646195: 13/17 passed

@yzh119 yzh119 merged commit f9cd034 into flashinfer-ai:main Oct 31, 2025
4 checks passed
BingooYang pushed a commit to BingooYang/flashinfer that referenced this pull request Mar 13, 2026
…shinfer-ai#2011)

## 📌 Description

This PR removes an assertion in the cutlass fused moe bindings to enable
non-gated activations in nvfp4.
It also adds a test for this path with relu2 activation.

## 🔍 Related Issues

N/A

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [v] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [v] I have installed the hooks with `pre-commit install`.
- [v] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [v] Tests have been added or updated as needed.
- [v] All tests are passing (`unittest`, etc.).

## Reviewer Notes

N/A

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Enhanced quantized Mixture of Experts models to support configurable
activation types (Swiglu and ReLU2) in the NVFP4 quantization path.
* Improved parameter handling to correctly adapt weight shapes and
quantization settings based on the selected activation type.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants