[depend on flashinfer 0.6.0) change mm_fp4 to auto for cu13 by vincentzed · Pull Request #16232 · sgl-project/sglang

vincentzed · 2025-12-31T17:12:10Z

Motivation

flashinfer-ai/flashinfer#1979 enables auto for mm_fp4, so we should use it for cu13 (except in the case that user specify flashinfer mm fp4 backend).
If cudnn>=9.15, then we should use cudnn. It is faster

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

gemini-code-assist · 2025-12-31T17:12:24Z

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing FP4 GEMM (General Matrix Multiply) operations by integrating the latest features from FlashInfer 0.6.0. The core change introduces an intelligent, automatic backend selection for FP4 GEMM, specifically targeting CUDA 13 environments. This enhancement aims to improve performance and efficiency by leveraging the most suitable underlying library (cuDNN or Cutlass) for matrix multiplications, without requiring manual configuration, thereby streamlining the execution of quantized models.

Highlights

FlashInfer Dependency Update: The pull request updates the FlashInfer dependency to version 0.6.0, which introduces enhanced capabilities for automatic backend selection in FP4 matrix multiplication operations.
Automatic FP4 GEMM Backend Selection for CUDA 13: A new mechanism has been implemented to automatically resolve the mm_fp4 GEMM backend. For CUDA 13 environments, if the FLASHINFER_FP4_GEMM_BACKEND environment variable is not explicitly set, the backend will default to 'auto', allowing FlashInfer to dynamically choose between cuDNN (if version >= 9.15) and Cutlass for optimal performance. For other CUDA versions, it defaults to 'cutlass'.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the mm_fp4 backend selection to use auto for CUDA 13, which is a good improvement that leverages new features in flashinfer. The refactoring to centralize the backend selection logic into a new function, resolve_mm_fp4_gemm_backend, improves code clarity and maintainability. I have one suggestion to further simplify this new function to reduce code duplication. Overall, the changes are well-structured and align with the goals of the pull request.

gemini-code-assist · 2025-12-31T17:14:23Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+    if _IS_CUDA_13:
+        # auto resolution: if cudnn < 9.15, use cutlass else use cudnn
+        # https://github.com/flashinfer-ai/flashinfer/pull/1979
+        backend = FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "auto"
+    else:
+        backend = (
+            FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "cutlass"
+        )
+    return backend


This function can be simplified to avoid repeating the check for FLASHINFER_FP4_GEMM_BACKEND. By checking for the user-specified backend first, you can reduce code duplication and make the logic more straightforward.

Suggested change

if _IS_CUDA_13:

# auto resolution: if cudnn < 9.15, use cutlass else use cudnn

# https://github.com/flashinfer-ai/flashinfer/pull/1979

backend = FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "auto"

else:

backend = (

FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "cutlass"

)

return backend

if FLASHINFER_FP4_GEMM_BACKEND:

return FLASHINFER_FP4_GEMM_BACKEND

if _IS_CUDA_13:

# auto resolution: if cudnn < 9.15, use cutlass else use cudnn

# https://github.com/flashinfer-ai/flashinfer/pull/1979

return "auto"

return "cutlass"

vincentzed · 2026-01-17T01:31:20Z

Before:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:09<00:00, 137.32it/s]
Accuracy: 0.699
Invalid: 0.002
Latency: 9.808 s
Output throughput: 13808.993 token/s

After:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:06<00:00, 207.66it/s]
Accuracy: 0.699
Invalid: 0.002
Latency: 6.387 s
Output throughput: 21204.965 token/s

Fridge003 · 2026-01-17T01:48:28Z

This change will be covered in #16534
Will add you as co-author in that PR

more

3e55994

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners December 31, 2025 17:12

github-actions bot added the quant LLM Quantization label Dec 31, 2025

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

Fridge003 mentioned this pull request Jan 1, 2026

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

Fridge003 closed this Jan 17, 2026

b8zhong deleted the vz/improve-backend-selection-mm-fp4 branch January 18, 2026 17:11

b8zhong mentioned this pull request Jan 18, 2026

[Refactor] Set fp4-gemm-backend=auto on SM100 and rename fp4-gemm-backend with flashinfer_ prefix #17309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[depend on flashinfer 0.6.0) change mm_fp4 to auto for cu13#16232

[depend on flashinfer 0.6.0) change mm_fp4 to auto for cu13#16232
vincentzed wants to merge 1 commit intosgl-project:mainfrom
bzhng-development:vz/improve-backend-selection-mm-fp4

vincentzed commented Dec 31, 2025

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

vincentzed commented Jan 17, 2026

Uh oh!

Fridge003 commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vincentzed commented Dec 31, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Jan 17, 2026

Uh oh!

Fridge003 commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants