[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate #9088

trevor-m · 2025-08-12T00:04:53Z

Motivation

2nd resubmit of #8364 - see this for perf

This PR contains sgl-kernel changes to fused routed scaling multiply into select_experts. #8690 will enable using this fusion for deepseek.

Removed unit test for now because it would fail until sgl-kernel is updated. Will reenable in #8690

gemini-code-assist

Summary of Changes

Hello @trevor-m, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the moe_fused_gate kernel within the sgl-kernel to optionally fuse the application of a routed_scaling_factor directly into its output calculation. This enhancement introduces a new control parameter to manage when this scaling is applied, aiming to optimize performance for models like DeepSeek by integrating the scaling operation more efficiently into the kernel execution.

Highlights

New Parameter for Scaling Control: A new boolean parameter, apply_routed_scaling_factor_on_output, has been introduced to the moe_fused_gate function across its C++, CUDA, and Python interfaces. This parameter provides explicit control over the application of the scaling factor.
Fused Scaling Logic: The core CUDA kernel (moe_fused_gate_impl) now conditionally applies the routed_scaling_factor directly to its output. This allows for the fusion of the scaling operation, potentially improving performance.
Python API and Documentation Updates: The Python wrapper and its documentation have been updated to reflect the new parameter and clarify the intended behavior of routed_scaling_factor, especially concerning its interaction with shared experts to ensure correct scaling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Summary of Changes

Hello @trevor-m, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature to the sgl-kernel's moe_fused_gate operation, allowing for conditional application of a routed scaling factor directly to the output. This enhancement provides finer control over the scaling behavior within Mixture-of-Experts (MoE) computations, which is particularly relevant for models like DeepSeek. It is part of a multi-part series of changes aimed at optimizing MoE operations.

Highlights

Enhanced MoE Fused Gate Functionality: A new boolean parameter, apply_routed_scaling_factor_on_output, has been added to the moe_fused_gate function across its C++, CUDA, and Python interfaces.
Conditional Output Scaling: The CUDA kernel for moe_fused_gate now conditionally applies the routed_scaling_factor to its output based on the value of the new apply_routed_scaling_factor_on_output flag.
Improved Documentation: The Python interface's docstrings have been updated to clearly explain the purpose of the new parameter and the interaction between routed_scaling_factor and num_fused_shared_experts for correct scaling of shared experts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to fuse the routed scaling factor into the moe_fused_gate kernel, controlled by a new apply_routed_scaling_factor_on_output flag. The changes are consistently applied across the C++ kernel, its Python wrapper, and the corresponding definitions. My review includes a suggestion to optimize the new logic inside the CUDA kernel for better performance by avoiding a division inside a loop.

gemini-code-assist · 2025-08-12T00:07:03Z

sgl-kernel/csrc/moe/moe_fused_gate.cu

    for (int ii = 0; ii < topk; ++ii) {
      int64_t const idx = topk * thread_row + ii;
      output_ptr[idx] = output_ptr[idx] / output_sum;
+      if (apply_routed_scaling_factor_on_output) {
+        output_ptr[idx] *= routed_scaling_factor;
+      }
    }


For better performance, you can hoist the division and conditional multiplication out of the loop. You can compute a scale factor once before the loop and then apply it with a single multiplication inside the loop. This avoids a potentially expensive division operation in every iteration and can improve kernel efficiency.

float scale = 1.0f / output_sum; if (apply_routed_scaling_factor_on_output) { scale *= routed_scaling_factor; } for (int ii = 0; ii < topk; ++ii) { int64_t const idx = topk * thread_row + ii; output_ptr[idx] *= scale; }

gemini-code-assist

Code Review

This pull request introduces an apply_routed_scaling_factor_on_output flag to conditionally apply a scaling factor in the moe_fused_gate kernel. The changes are consistently applied across the C++ kernel, its Python wrapper, and the Torch library definition. My review includes a suggestion to optimize the scaling logic within the CUDA kernel for better performance by avoiding division inside a loop.

gemini-code-assist · 2025-08-12T00:08:49Z

sgl-kernel/csrc/moe/moe_fused_gate.cu

+      if (apply_routed_scaling_factor_on_output) {
+        output_ptr[idx] *= routed_scaling_factor;
+      }


While this logic is correct, combining it with the division on the previous line inside a loop is inefficient. For better performance, consider calculating a single scaling factor before the loop and applying it with a single multiplication inside the loop.

For example:

// Before the loop float scale = 1.0f / output_sum; if (apply_routed_scaling_factor_on_output) { scale *= static_cast<float>(routed_scaling_factor); } // Inside the loop, replacing lines 251-254 output_ptr[idx] *= scale;

This would replace the division and conditional multiplication inside the loop with a single, more efficient multiplication.

…_fused_gate (sgl-project#9088)

trevor-m added 2 commits August 11, 2025 22:10

[sgl-kernel] Add apply_routed_scaling_factor_on_output to moe_fused_gate

daffddb

Remove unit test

c343dcf

trevor-m requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners August 12, 2025 00:04

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

merrymercy approved these changes Aug 13, 2025

View reviewed changes

merrymercy merged commit 13c48dc into sgl-project:main Aug 13, 2025
52 of 56 checks passed

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe…

4bf1c24

…_fused_gate (sgl-project#9088)

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe…

80f2e89

…_fused_gate (sgl-project#9088)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate #9088

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate #9088

Uh oh!

trevor-m commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate #9088

[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate #9088

Uh oh!

Conversation

trevor-m commented Aug 12, 2025

Motivation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants