use fast math for per_token_group_quant_8bit.#9177
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @strgrb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a performance optimization for sglang by enabling fast math operations for common CUDA kernels. It addresses an observed performance difference where CUDA 12.8 was slower than CUDA 12.4 for specific operations, a discrepancy traced back to default compiler flag differences. By explicitly adding the --use_fast_math flag to the common_ops compilation, this change aims to restore optimal performance on newer CUDA versions.
Highlights
- Performance Enhancement: Enables fast math optimizations for sglang's common CUDA operations to improve runtime performance.
- Compiler Configuration: Modifies sgl-kernel/CMakeLists.txt to include the --use_fast_math flag in the CUDA compile options for the common_ops target.
- Addressing CUDA Version Discrepancy: Resolves a performance regression where CUDA 12.8 showed slower execution than CUDA 12.4 for certain kernels due to differing default fast math settings.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request aims to improve performance by enabling fast math for common CUDA operations, which is a great initiative. The provided profiling data clearly shows the potential benefits. My review focuses on ensuring the CMake configuration is correct. I've found a potential issue in how the compiler flag is added and have suggested a fix. I've also noted a small opportunity for future code cleanup to make the CMake flags management more consistent.
|
If there is any problem with fp4 kernels, I can make this option available for only some of total files. |
|
btw for that kernel I already optimized it (but not yet merged...) #7601 |
|
I remember that enabling fast math on cu128 caused accuracy issues in some kernels. |
|
iirc for kernels that is ok to have fast math, we may be able to manually call the nv intrinsics to speed up |
Maybe it's better to use a src list to decide whether to use fast math in CMakeLists.txt? |
This reverts commit c393f50.
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
…)" This reverts commit 1f9d65f.
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Motivation
I found that sglang is faster with cuda12.4 than cuda12.8 on hopper, and I test single kernel
per_token_group_quant_8bit.cuda12.4
cuda12.8
By checking sass code generated, I found cuda 12.4 use
-ftzor--use_fast_mathby default, and cuda 12.8 don't.For example:
vs
Modifications
I add
--use_fast_mathto targetcommon_opsto make all common ops to use fastmath.I'll test e2e performance and accuracy for cuda12.8 with ds and paste result here later.
Accuracy Tests
Benchmarking and Profiling
Checklist