enable flashinfer fp8 gemm if deepgemm disabled by b8zhong · Pull Request #13608 · sgl-project/sglang

b8zhong · 2025-11-19T22:48:26Z

If DeepGEMM is disabled and on SM100, don't fall back to Triton, we have Flashinfer FP8 GEMM available, which is a lot more performant.

The reasoning is bc personally, I use this for when the BS is small, the speedup is quite noticeable compared to DeepGEMM.

gemini-code-assist · 2025-11-19T22:48:38Z

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a performance optimization for FP8 General Matrix Multiply (GEMM) operations. It modifies the activation conditions for Flashinfer's FP8 GEMM implementation, ensuring it is utilized on Blackwell architectures (SM100) when the DeepGEMM library is not enabled. This strategic enablement aims to bypass less performant Triton kernels, thereby enhancing computational efficiency.

Highlights

Flashinfer FP8 GEMM Activation: Flashinfer FP8 GEMM is now enabled when DeepGEMM is disabled and the system is on Blackwell (SM100), preventing a fallback to Triton.
Performance Improvement: This change is expected to significantly improve performance for FP8 GEMM operations under the specified conditions.
Conditional Logic Refinement: The logic for enabling Flashinfer FP8 GEMM has been updated to explicitly check for Blackwell support, Flashinfer availability, and either an environment variable flag or the disabling of DeepGEMM.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly enables Flashinfer FP8 GEMM when DeepGEMM is disabled on Blackwell GPUs. This change aligns with the goal of using more performant kernels, as Flashinfer is preferred over the Triton fallback in this scenario. The introduction of the FLASHINFER_FP8_GEMM_SUPPORTED variable improves code readability by clearly defining the conditions for Flashinfer support. The logic is sound and the implementation is clean.

Fridge003 · 2025-11-19T22:55:57Z

But will this block the usage of other fp8 kernels, like the cutlass one?

b8zhong · 2025-11-19T23:01:59Z

@Fridge003 No it still works, you can still enable it through CUTLASS_BLOCK_FP8_SUPPORTED (like before) manually

Fridge003 · 2025-11-20T03:20:41Z

@Fridge003 No it still works, you can still enable it through CUTLASS_BLOCK_FP8_SUPPORTED (like before) manually

But when I only add flag CUTLASS_BLOCK_FP8_SUPPORTED, the flashinfer gemm will also be enabled. Then the linear layer will pick flashinfer implementation since it ranks the first in dispatching

b8zhong · 2025-11-20T05:42:04Z

Err, good point. Here is the dispatching logics now:

On supported hardwares, DeepGEMM is on by default.
If DeepGEMM is disabled, do the following:

If Flashinfer is enabled, or CUTLASS FP8 env var is not set, use Flashinfer.
If CUTLASS FP8 env var is set, use CUTLASS.
If neither of these two are set, use Flashinfer.
If not on a compatible platform, use AITER or Triton impl.

Copilot

Pull Request Overview

This PR enables FlashInfer FP8 GEMM as an automatic fallback when DeepGEMM is disabled on SM100 (GB100) GPUs, providing better performance than Triton for small batch sizes.

Key changes:

Moves is_blackwell_supported from fp8_utils to utils module and switches to is_sm100_supported for FlashInfer support detection
Auto-enables FlashInfer FP8 GEMM when DeepGEMM is disabled and CUTLASS is not explicitly configured
Reorganizes dispatch priority to check DeepGEMM first before other backends

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
python/sglang/srt/layers/quantization/modelopt_quant.py	Updates import to use `is_blackwell_supported` from `utils` instead of `fp8_utils`
python/sglang/srt/layers/quantization/fp8_utils.py	Refactors FlashInfer FP8 GEMM enablement logic and dispatch priority, switches from Blackwell to SM100 detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/sglang/srt/layers/quantization/fp8_utils.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-21T21:57:40Z

python/sglang/srt/layers/quantization/fp8_utils.py

 from sglang.srt.layers.quantization.fp8_kernel import sglang_per_token_group_quant_fp8
 from sglang.srt.layers.quantization.mxfp4_tensor import MXFP4QuantizeUtil
-from sglang.srt.utils import ceil_div, is_blackwell_supported, offloader
+from sglang.srt.utils import ceil_div, offloader


[nitpick] The imports from sglang.srt.utils are split across two separate import statements (lines 9 and 30-40). Consider consolidating them into a single import statement for better code organization and consistency with Python style guidelines.

b8zhong · 2025-12-04T04:59:12Z

Cleaner solution in #14379

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 19, 2025 22:48

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

b8zhong added the run-ci label Nov 19, 2025

github-actions bot added the quant LLM Quantization label Nov 20, 2025

b8zhong requested a review from Copilot November 20, 2025 05:50

Copilot started reviewing on behalf of b8zhong November 20, 2025 05:51 View session

Copilot finished reviewing on behalf of b8zhong November 20, 2025 05:53

Copilot AI reviewed Nov 20, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/fp8_utils.py Show resolved Hide resolved

more

103bfd8

b8zhong force-pushed the brayden/auto-enable-flashinfer-fp8-gemm branch from f74e3b7 to 103bfd8 Compare November 21, 2025 21:53

b8zhong requested a review from Copilot November 21, 2025 21:53

Copilot started reviewing on behalf of b8zhong November 21, 2025 21:53 View session

Copilot finished reviewing on behalf of b8zhong November 21, 2025 21:56

Copilot AI reviewed Nov 21, 2025

View reviewed changes

Merge branch 'main' into brayden/auto-enable-flashinfer-fp8-gemm

0a40017

b8zhong closed this Dec 4, 2025

b8zhong deleted the brayden/auto-enable-flashinfer-fp8-gemm branch December 4, 2025 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable flashinfer fp8 gemm if deepgemm disabled#13608

enable flashinfer fp8 gemm if deepgemm disabled#13608
b8zhong wants to merge 2 commits intomainfrom
brayden/auto-enable-flashinfer-fp8-gemm

b8zhong commented Nov 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Fridge003 commented Nov 19, 2025

Uh oh!

b8zhong commented Nov 19, 2025

Uh oh!

Fridge003 commented Nov 20, 2025

Uh oh!

b8zhong commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

b8zhong commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Fridge003 commented Nov 19, 2025

Uh oh!

b8zhong commented Nov 19, 2025

Uh oh!

Fridge003 commented Nov 20, 2025

Uh oh!

b8zhong commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Nov 19, 2025 •

edited

Loading