use fast math for per_token_group_quant_8bit. by strgrb · Pull Request #9177 · sgl-project/sglang

strgrb · 2025-08-14T07:17:16Z

Motivation

I found that sglang is faster with cuda12.4 than cuda12.8 on hopper, and I test single kernel per_token_group_quant_8bit .

cuda12.4

  void per_token_group_quant_8bit_kernel<__nv_bfloat16, __nv_fp8_e4m3, 1, 0, float>(const T1 *, void *, T5 *, int, int, int, float, float, float, int, int) (14336, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         2.62
    SM Frequency                    Ghz         1.81
    Elapsed Cycles                cycle       97,524
    Memory Throughput                 %        56.19
    DRAM Throughput                   %        35.62
    Duration                         us        53.60
    L1/TEX Cache Throughput           %        58.55
    L2 Cache Throughput               %        52.34
    SM Active Cycles              cycle    93,136.92
    Compute (SM) Throughput           %        70.90
    ----------------------- ----------- ------------

cuda12.8

  void per_token_group_quant_8bit_kernel<__nv_bfloat16, __nv_fp8_e4m3, 1, 0, float>(const T1 *, void *, T5 *, int, int, int, float, float, float, int, int) (14336, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         2.62
    SM Frequency                    Ghz         1.82
    Elapsed Cycles                cycle      148,173
    Memory Throughput                 %        81.99
    DRAM Throughput                   %        23.46
    Duration                         us        81.02
    L1/TEX Cache Throughput           %        84.19
    L2 Cache Throughput               %        84.10
    SM Active Cycles              cycle   143,937.88
    Compute (SM) Throughput           %        68.67
    ----------------------- ----------- ------------

By checking sass code generated, I found cuda 12.4 use -ftz or --use_fast_math by default, and cuda 12.8 don't.
For example:

FMNMX.FTZ R14, R15, |R14|, !PT

vs

FMNMX R8, R8, |R17|, !PT

Modifications

I add --use_fast_math to target common_ops to make all common ops to use fastmath.
I'll test e2e performance and accuracy for cuda12.8 with ds and paste result here later.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @strgrb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a performance optimization for sglang by enabling fast math operations for common CUDA kernels. It addresses an observed performance difference where CUDA 12.8 was slower than CUDA 12.4 for specific operations, a discrepancy traced back to default compiler flag differences. By explicitly adding the --use_fast_math flag to the common_ops compilation, this change aims to restore optimal performance on newer CUDA versions.

Highlights

Performance Enhancement: Enables fast math optimizations for sglang's common CUDA operations to improve runtime performance.
Compiler Configuration: Modifies sgl-kernel/CMakeLists.txt to include the --use_fast_math flag in the CUDA compile options for the common_ops target.
Addressing CUDA Version Discrepancy: Resolves a performance regression where CUDA 12.8 showed slower execution than CUDA 12.4 for certain kernels due to differing default fast math settings.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to improve performance by enabling fast math for common CUDA operations, which is a great initiative. The provided profiling data clearly shows the potential benefits. My review focuses on ensuring the CMake configuration is correct. I've found a potential issue in how the compiler flag is added and have suggested a fix. I've also noted a small opportunity for future code cleanup to make the CMake flags management more consistent.

sgl-kernel/CMakeLists.txt

strgrb · 2025-08-14T07:53:22Z

If there is any problem with fp4 kernels, I can make this option available for only some of total files.

fzyzcjy · 2025-08-14T08:59:33Z

btw for that kernel I already optimized it (but not yet merged...) #7601

zhyncs · 2025-08-14T09:52:34Z

I remember that enabling fast math on cu128 caused accuracy issues in some kernels.

fzyzcjy · 2025-08-14T09:57:16Z

iirc for kernels that is ok to have fast math, we may be able to manually call the nv intrinsics to speed up

strgrb · 2025-08-14T11:10:09Z

iirc for kernels that is ok to have fast math, we may be able to manually call the nv intrinsics to speed up

Maybe it's better to use a src list to decide whether to use fast math in CMakeLists.txt?

This reverts commit c393f50.

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

…)" This reverts commit 1f9d65f.

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

use fast math for sgl common ops.

ebfb5cc

strgrb requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners August 14, 2025 07:17

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

sgl-kernel/CMakeLists.txt Outdated Show resolved Hide resolved

fix

72363b8

Zhang Kaihong and others added 2 commits August 14, 2025 16:07

use list

c393f50

Merge branch 'main' into fast_math

343f615

Zhang Kaihong added 2 commits August 14, 2025 19:20

Revert "use list"

0fc82fc

This reverts commit c393f50.

Merge remote-tracking branch 'strgrb/fast_math' into fast_math

8829436

strgrb changed the title ~~use fast math for sgl common ops.~~ use fast math for per_token_group_quant_8bit. Aug 14, 2025

Merge branch 'main' into fast_math

e75b12d

zhyncs self-assigned this Aug 14, 2025

Merge branch 'main' into fast_math

acee4a6

zhyncs approved these changes Aug 15, 2025

View reviewed changes

zhyncs merged commit 1f9d65f into sgl-project:main Aug 15, 2025
83 of 87 checks passed

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

use fast math for per_token_group_quant_8bit. (sgl-project#9177)

2ddc438

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

zhyncs added a commit to FlamingoPg/sglang that referenced this pull request Aug 27, 2025

Revert "use fast math for per_token_group_quant_8bit. (sgl-project#9177…

0c60431

…)" This reverts commit 1f9d65f.

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

use fast math for per_token_group_quant_8bit. (sgl-project#9177)

d1c54b7

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

use fast math for per_token_group_quant_8bit.#9177

use fast math for per_token_group_quant_8bit.#9177
zhyncs merged 8 commits intosgl-project:mainfrom
strgrb:fast_math

strgrb commented Aug 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

strgrb commented Aug 14, 2025

Uh oh!

fzyzcjy commented Aug 14, 2025

Uh oh!

zhyncs commented Aug 14, 2025

Uh oh!

fzyzcjy commented Aug 14, 2025

Uh oh!

strgrb commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

strgrb commented Aug 14, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

strgrb commented Aug 14, 2025

Uh oh!

fzyzcjy commented Aug 14, 2025

Uh oh!

zhyncs commented Aug 14, 2025

Uh oh!

fzyzcjy commented Aug 14, 2025

Uh oh!

strgrb commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants