gpt-oss decode performance optimization by kkHuang-amd · Pull Request #20392 · sgl-project/sglang

kkHuang-amd · 2026-03-12T00:07:01Z

Motivation

Improve the performance for gpt-oss model run

Modifications

Three parts for this PR

linear operation optimization by using triton kernel to replace the naive a16w16 gemm
fused the elementwise kernels of save kv into one triton kernel
use unified_attention triton kernel to replace triton decode attention kernels

Accuracy Tests

Server command
SGLANG_USE_AITER=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b/ --tp 8 --chunked-prefill-size 131072 --max-running-requests 128 --mem-fraction-static 0.85 --disable-radix-cache --page-size 64

Client command
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 Accuracy: 0.851 Invalid: 0.014 Latency: 47.111 s Output throughput: 9039.830 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-12T00:08:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-16T05:48:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…riton_unified_attention'

HaiShaw

There are large amount of duplicated code snippets, suggest to refactor soon in follow-up PR.
@kkHuang-amd

HaiShaw · 2026-03-18T06:56:26Z

python/sglang/srt/layers/attention/utils.py

        return concat_mla_absorb_q(q_nope, q_rope)
    else:
        return torch.cat([q_nope, q_rope], dim=-1)
+


Please add kernel description and comment to new functions introduced.

Two fixes for aiter backend failures surfaced by PR #20392: 1. aiter_backend.py: Cap max_num_partitions by min(max_context_len, max_total_num_tokens). The workspace buffer was sized for the model's theoretical max context (e.g. 131K = 512 partitions = 16 GiB) when the KV cache only held 25K tokens (100 partitions = 3 GiB), causing OOM on memory-constrained CI GPUs. 2. unquant.py: Add aiter tgemm.mm fast path for unquantized linear ops, guarded by type(layer.weight.data) is torch.Tensor. Torchao-quantized weights (AffineQuantizedTensor) fail the strict type() check and fall through to F.linear, preventing NotImplementedError on gemm_a16w16.

Co-authored-by: wunhuang <wunhuang@amd.com>

wunhuang added 9 commits March 3, 2026 05:17

Support sliding window attention in aiter for decode stage (eager mode)

b3fb546

Add unified_attention to support sliding window decode attention

b09d385

Accuracy pass with cuda-graph enable

6617348

remove testing code

45ddd06

some minor changes

cb7b2cb

Fix the page size > 1 accuracy issue for unified_attention

ec5a604

fix max_kv_len issue

123f574

gemm opt

bfb406d

fused some elementwise kernel for kv cache store

f62773a

github-actions bot added the quant LLM Quantization label Mar 12, 2026

Merge branch 'main' into gpt-oss-aiter-decode

0e420f2

kkHuang-amd added amd run-ci labels Mar 12, 2026

root added 5 commits March 13, 2026 07:35

Fused kv_indices convert swa_kv_indices elementwise kernels

bf21168

Add reshape_and_cache_flash triton kernel into sglang attention utils.py

04799c8

Refactor kv cache code

fd30389

Code formating

9e3eba7

Refactor code

31fe2ce

kkHuang-amd marked this pull request as ready for review March 16, 2026 05:48

kkHuang-amd requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, b8zhong, ch-wan, ispobock and merrymercy as code owners March 16, 2026 05:48

kkHuang-amd requested review from Qiaolin-Yu and hebiao064 as code owners March 16, 2026 05:48

Merge branch 'main' into gpt-oss-aiter-decode

15a3ac2

kkHuang-amd changed the title ~~Gpt oss aiter decode~~ Gpt oss decode performance optimization Mar 16, 2026

kkHuang-amd changed the title ~~Gpt oss decode performance optimization~~ gpt-oss decode performance optimization Mar 16, 2026

root and others added 3 commits March 16, 2026 06:55

Use tgemm to select the best gemm solution

04dcab2

fix AttributeError: 'AiterAttnBackend' object has no attribute 'use_t…

583f01f

…riton_unified_attention'

Merge branch 'main' into gpt-oss-aiter-decode

ea70299

HaiShaw reviewed Mar 18, 2026

View reviewed changes

Add kernel description and comment to new functions introduced.

1355ecb

HaiShaw approved these changes Mar 18, 2026

View reviewed changes

This was referenced Mar 18, 2026

fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM #20888

Closed

fix(aiter): use tuned GEMM for unquantized linear with torchao compatibility guard #20889

Closed

michaelzhang-ai mentioned this pull request Mar 18, 2026

[AMD] fix CI: workspace buffer OOM and tuned GEMM torchao compatibility #20890

Closed

1 task

Fix CI errors

a580515

HaiShaw merged commit 126cd5c into sgl-project:main Mar 19, 2026
86 of 103 checks passed

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

gpt-oss decode performance optimization (sgl-project#20392)

d626fcb

Co-authored-by: wunhuang <wunhuang@amd.com>

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

gpt-oss decode performance optimization (sgl-project#20392)

216dc42

Co-authored-by: wunhuang <wunhuang@amd.com>

dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026

gpt-oss decode performance optimization (sgl-project#20392)

6b66fcd

Co-authored-by: wunhuang <wunhuang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt-oss decode performance optimization#20392

gpt-oss decode performance optimization#20392
HaiShaw merged 21 commits intosgl-project:mainfrom
HaiShaw:gpt-oss-aiter-decode

kkHuang-amd commented Mar 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Uh oh!

HaiShaw left a comment •

edited

Loading

Uh oh!

HaiShaw Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kkHuang-amd commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaiShaw Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kkHuang-amd commented Mar 12, 2026 •

edited

Loading

HaiShaw left a comment •

edited

Loading