Skip to content

Comments

use fast math for per_token_group_quant_8bit.#9177

Merged
zhyncs merged 8 commits intosgl-project:mainfrom
strgrb:fast_math
Aug 15, 2025
Merged

use fast math for per_token_group_quant_8bit.#9177
zhyncs merged 8 commits intosgl-project:mainfrom
strgrb:fast_math

Conversation

@strgrb
Copy link
Collaborator

@strgrb strgrb commented Aug 14, 2025

Motivation

I found that sglang is faster with cuda12.4 than cuda12.8 on hopper, and I test single kernel per_token_group_quant_8bit .

cuda12.4

  void per_token_group_quant_8bit_kernel<__nv_bfloat16, __nv_fp8_e4m3, 1, 0, float>(const T1 *, void *, T5 *, int, int, int, float, float, float, int, int) (14336, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         2.62
    SM Frequency                    Ghz         1.81
    Elapsed Cycles                cycle       97,524
    Memory Throughput                 %        56.19
    DRAM Throughput                   %        35.62
    Duration                         us        53.60
    L1/TEX Cache Throughput           %        58.55
    L2 Cache Throughput               %        52.34
    SM Active Cycles              cycle    93,136.92
    Compute (SM) Throughput           %        70.90
    ----------------------- ----------- ------------

cuda12.8

  void per_token_group_quant_8bit_kernel<__nv_bfloat16, __nv_fp8_e4m3, 1, 0, float>(const T1 *, void *, T5 *, int, int, int, float, float, float, int, int) (14336, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         2.62
    SM Frequency                    Ghz         1.82
    Elapsed Cycles                cycle      148,173
    Memory Throughput                 %        81.99
    DRAM Throughput                   %        23.46
    Duration                         us        81.02
    L1/TEX Cache Throughput           %        84.19
    L2 Cache Throughput               %        84.10
    SM Active Cycles              cycle   143,937.88
    Compute (SM) Throughput           %        68.67
    ----------------------- ----------- ------------

By checking sass code generated, I found cuda 12.4 use -ftz or --use_fast_math by default, and cuda 12.8 don't.
For example:

FMNMX.FTZ R14, R15, |R14|, !PT

vs

FMNMX R8, R8, |R17|, !PT

Modifications

I add --use_fast_math to target common_ops to make all common ops to use fastmath.
I'll test e2e performance and accuracy for cuda12.8 with ds and paste result here later.

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @strgrb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a performance optimization for sglang by enabling fast math operations for common CUDA kernels. It addresses an observed performance difference where CUDA 12.8 was slower than CUDA 12.4 for specific operations, a discrepancy traced back to default compiler flag differences. By explicitly adding the --use_fast_math flag to the common_ops compilation, this change aims to restore optimal performance on newer CUDA versions.

Highlights

  • Performance Enhancement: Enables fast math optimizations for sglang's common CUDA operations to improve runtime performance.
  • Compiler Configuration: Modifies sgl-kernel/CMakeLists.txt to include the --use_fast_math flag in the CUDA compile options for the common_ops target.
  • Addressing CUDA Version Discrepancy: Resolves a performance regression where CUDA 12.8 showed slower execution than CUDA 12.4 for certain kernels due to differing default fast math settings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by enabling fast math for common CUDA operations, which is a great initiative. The provided profiling data clearly shows the potential benefits. My review focuses on ensuring the CMake configuration is correct. I've found a potential issue in how the compiler flag is added and have suggested a fix. I've also noted a small opportunity for future code cleanup to make the CMake flags management more consistent.

@strgrb
Copy link
Collaborator Author

strgrb commented Aug 14, 2025

If there is any problem with fp4 kernels, I can make this option available for only some of total files.

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Aug 14, 2025

btw for that kernel I already optimized it (but not yet merged...) #7601

@zhyncs
Copy link
Collaborator

zhyncs commented Aug 14, 2025

I remember that enabling fast math on cu128 caused accuracy issues in some kernels.

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Aug 14, 2025

iirc for kernels that is ok to have fast math, we may be able to manually call the nv intrinsics to speed up

@strgrb
Copy link
Collaborator Author

strgrb commented Aug 14, 2025

iirc for kernels that is ok to have fast math, we may be able to manually call the nv intrinsics to speed up

Maybe it's better to use a src list to decide whether to use fast math in CMakeLists.txt?

@strgrb strgrb changed the title use fast math for sgl common ops. use fast math for per_token_group_quant_8bit. Aug 14, 2025
@zhyncs zhyncs self-assigned this Aug 14, 2025
@zhyncs zhyncs merged commit 1f9d65f into sgl-project:main Aug 15, 2025
83 of 87 checks passed
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
zhyncs added a commit to FlamingoPg/sglang that referenced this pull request Aug 27, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants