[2/2] Optimizations and refactors about quant kernel by fzyzcjy · Pull Request #7601 · sgl-project/sglang

fzyzcjy · 2025-06-27T11:43:34Z

EDIT: kernel part is in #9534

EDIT: the code is ready and has large speedup and is being reviewed, I mark as "draft" b/c do not want (frequent) code push to trigger all CIs.

EDIT: 9344 -> 9945, e2e 6.4%

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces optimizations and refactors to the quantization kernel, specifically focusing on per-token-group quantization for FP8 and INT8 data types. The changes include performance improvements, code unification, and the addition of a new scaling flag, enhancing the flexibility and efficiency of the quantization process.

Highlights

Optimization: Optimized the per-token-group quantization kernel for both FP8 and INT8 data types.
Refactor: Refactored the code to unify INT8 and FP8 quantization logic where possible, improving code maintainability.
New Feature: Added support for a new scale_ue8m0 flag to control the scaling behavior during quantization.
Configuration: Introduced PER_TOKEN_GROUP_QUANT_8BIT_VALID_FLAGS to define valid configurations for quantization parameters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces optimizations and refactors for quantization kernels, adding support for scale_ue8m0. The code is refactored to be cleaner by extracting helper functions and removing duplicated code from test and benchmark files.

gemini-code-assist · 2025-06-27T11:45:56Z

python/sglang/srt/layers/quantization/fp8_kernel.py

+def per_token_group_quant_8bit(
    x: torch.Tensor,
    group_size: int,
+    dst_dtype: torch.dtype,
    eps: float = 1e-10,
    column_major_scales: bool = False,
    scale_tma_aligned: bool = False,
    scale_ue8m0: bool = False,
-):
-    assert (
-        x.shape[-1] % group_size == 0
-    ), "the last dimension of `x` cannot be divisible by `group_size`"
-    assert x.is_contiguous(), "`x` is not contiguous"
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    from sglang.srt.layers.quantization.int8_kernel import per_token_group_quant_int8
+
+    if dst_dtype == torch.int8:
+        assert not column_major_scales
+        assert not scale_tma_aligned
+        assert not scale_ue8m0
+        return per_token_group_quant_int8(
+            x=x,
+            group_size=group_size,
+            eps=eps,
+            dtype=dst_dtype,
+        )

-    x_q = torch.empty_like(x, device=x.device, dtype=fp8_dtype)
+    return per_token_group_quant_fp8(
+        x=x,
+        group_size=group_size,
+        eps=eps,
+        column_major_scales=column_major_scales,


The per_token_group_quant_8bit function is missing the scale_ue8m0 parameter in its signature, but it's called with scale_ue8m0 as a keyword argument in the tests. This will cause a TypeError at runtime. Additionally, scale_ue8m0 should be asserted False for int8 quantization, as it is not supported.

def per_token_group_quant_8bit( x: torch.Tensor, group_size: int, dst_dtype: torch.dtype, eps: float = 1e-10, column_major_scales: bool = False, scale_tma_aligned: bool = False, scale_ue8m0: bool = False, ) -> Tuple[torch.Tensor, torch.Tensor]: from sglang.srt.layers.quantization.int8_kernel import per_token_group_quant_int8 if dst_dtype == torch.int8: assert not column_major_scales assert not scale_tma_aligned assert not scale_ue8m0, "scale_ue8m0 is not supported for int8 quantization" return per_token_group_quant_int8( x=x, group_size=group_size, eps=eps, dtype=dst_dtype, ) return per_token_group_quant_fp8( x=x, group_size=group_size, eps=eps, column_major_scales=column_major_scales, scale_tma_aligned=scale_tma_aligned, scale_ue8m0=scale_ue8m0, )

gemini-code-assist · 2025-06-27T11:45:56Z

python/sglang/srt/layers/quantization/fp8_kernel.py

+    scale_tma_aligned: bool,
+    scale_ue8m0: bool,
+):
    if scale_ue8m0:


The group size is hardcoded to 128 here. Use the group_size parameter passed into this function to make it more generic.

x_s_mn, x_s_k = x_q_mn, x_q_k // group_size

gemini-code-assist · 2025-06-27T11:45:56Z

python/sglang/srt/layers/quantization/fp8_kernel.py

            dtype=torch.int,
        ).transpose(0, 1)[:x_s_mn, :]
    elif column_major_scales:
        if scale_tma_aligned:


There is a TODO comment here to extract an align function. The align function is already imported from sglang.math_utils. Use it directly.

aligned_size = align(x_shape[-2], 4)

This reverts commit d28575e.

This reverts commit f13e952.

This reverts commit c7f3d13.

This reverts commit 97a2d5f.

# Conflicts: # python/sglang/srt/bench_utils.py # python/sglang/srt/layers/quantization/fp8_kernel.py # sgl-kernel/benchmark/bench_per_token_group_quant_8bit.py # sgl-kernel/csrc/gemm/per_token_group_quant_8bit.cu # sgl-kernel/tests/test_per_token_group_quant_8bit.py

fzyzcjy · 2025-09-06T23:32:53Z

FYI this is currently depending on releasing new sgl-kernel

gemini-code-assist bot reviewed Jun 27, 2025

View reviewed changes

fzyzcjy changed the title ~~Optimizations and refactors about quant kernel~~ Optimizations and refactors about quant kernel (do not merge) Jun 27, 2025

fzyzcjy force-pushed the feat/opt_quant_related branch 2 times, most recently from dd7eb08 to 531478f Compare June 29, 2025 02:38

fzyzcjy added 25 commits June 30, 2025 14:35

fmt

2c2a472

enable int8

6e1c3f6

more

d28575e

Revert "more"

f13e952

This reverts commit d28575e.

Revert "Revert "more""

30023e8

This reverts commit f13e952.

more

ab67657

more

a855873

copy

75f51ae

rename

713d18d

more

88a3987

more

de4585e

fmt

67466c5

more

a99ae2a

more

2b7eda0

more

dfce542

more

b387100

more

48da6fc

more

184be0b

more

1e7b01a

more

e14d6d8

more

c7f3d13

Revert "more"

28a68eb

This reverts commit c7f3d13.

subwarps_per_block log 2

97a2d5f

Revert "subwarps_per_block log 2"

4872eab

This reverts commit 97a2d5f.

more

e5b214c

fzyzcjy added 8 commits July 20, 2025 11:21

fmt

dfe2b22

more

8426656

more

f1160e8

more

4cce6ca

more

d752b61

more

321e838

more

857c714

more

830de3f

ispobock mentioned this pull request Jul 20, 2025

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

Merged

6 tasks

ispobock assigned Alcanderian and BBuf Jul 20, 2025

fzyzcjy mentioned this pull request Aug 14, 2025

use fast math for per_token_group_quant_8bit. #9177

Merged

4 tasks

fzyzcjy requested review from Edwardf0t1 and kushanam as code owners August 22, 2025 03:01

fzyzcjy and others added 2 commits August 23, 2025 12:28

Merge branch 'main-upstream' into feat/opt_quant_related

008de74

Merge branch 'main' into feat/opt_quant_related

c5da771

fzyzcjy changed the title ~~Optimizations and refactors about quant kernel (wait for review then split code and merge)~~ [2/2] Optimizations and refactors about quant kernel Aug 23, 2025

fzyzcjy mentioned this pull request Aug 23, 2025

[1/2] Optimizations and refactors about quant kernel #9534

Merged

4 tasks

fzyzcjy and others added 5 commits August 23, 2025 22:18

ci

5e1a7ea

ci

8665f54

Merge branch 'main' into feat/opt_quant_related

df9adb3

Merge branch 'main' into feat/opt_quant_related

52326c5

Merge branch 'main' into feat/opt_quant_related

a100e94

fzyzcjy added 3 commits September 7, 2025 13:54

Merge branch 'main' into feat/opt_quant_related

d600bfe

Merge branch 'main' into feat/opt_quant_related

a4702b7

Update layer.py

64b21bd

merrymercy requested review from AniZpZ and Fridge003 as code owners November 29, 2025 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[2/2] Optimizations and refactors about quant kernel#7601

[2/2] Optimizations and refactors about quant kernel#7601
fzyzcjy wants to merge 587 commits intosgl-project:mainfrom
fzyzcjy:feat/opt_quant_related

fzyzcjy commented Jun 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 27, 2025

Uh oh!

gemini-code-assist bot Jun 27, 2025

Uh oh!

gemini-code-assist bot Jun 27, 2025

Uh oh!

fzyzcjy commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

fzyzcjy commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fzyzcjy commented Jun 27, 2025 •

edited

Loading