Skip to content

gpt-oss decode performance optimization#20392

Merged
HaiShaw merged 21 commits intosgl-project:mainfrom
HaiShaw:gpt-oss-aiter-decode
Mar 19, 2026
Merged

gpt-oss decode performance optimization#20392
HaiShaw merged 21 commits intosgl-project:mainfrom
HaiShaw:gpt-oss-aiter-decode

Conversation

@kkHuang-amd
Copy link
Copy Markdown
Collaborator

@kkHuang-amd kkHuang-amd commented Mar 12, 2026

Motivation

Improve the performance for gpt-oss model run

Modifications

Three parts for this PR

  1. linear operation optimization by using triton kernel to replace the naive a16w16 gemm
  2. fused the elementwise kernels of save kv into one triton kernel
  3. use unified_attention triton kernel to replace triton decode attention kernels

Accuracy Tests

Server command
SGLANG_USE_AITER=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b/ --tp 8 --chunked-prefill-size 131072 --max-running-requests 128 --mem-fraction-static 0.85 --disable-radix-cache --page-size 64

Client command
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 Accuracy: 0.851 Invalid: 0.014 Latency: 47.111 s Output throughput: 9039.830 token/s

Benchmarking and Profiling

image

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the quant LLM Quantization label Mar 12, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kkHuang-amd kkHuang-amd changed the title Gpt oss aiter decode Gpt oss decode performance optimization Mar 16, 2026
@kkHuang-amd kkHuang-amd changed the title Gpt oss decode performance optimization gpt-oss decode performance optimization Mar 16, 2026
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are large amount of duplicated code snippets, suggest to refactor soon in follow-up PR.
@kkHuang-amd

return concat_mla_absorb_q(q_nope, q_rope)
else:
return torch.cat([q_nope, q_rope], dim=-1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add kernel description and comment to new functions introduced.

michaelzhang-ai added a commit that referenced this pull request Mar 18, 2026
Two fixes for aiter backend failures surfaced by PR #20392:

1. aiter_backend.py: Cap max_num_partitions by min(max_context_len,
   max_total_num_tokens). The workspace buffer was sized for the model's
   theoretical max context (e.g. 131K = 512 partitions = 16 GiB) when
   the KV cache only held 25K tokens (100 partitions = 3 GiB), causing
   OOM on memory-constrained CI GPUs.

2. unquant.py: Add aiter tgemm.mm fast path for unquantized linear ops,
   guarded by type(layer.weight.data) is torch.Tensor. Torchao-quantized
   weights (AffineQuantizedTensor) fail the strict type() check and fall
   through to F.linear, preventing NotImplementedError on gemm_a16w16.
@HaiShaw HaiShaw merged commit 126cd5c into sgl-project:main Mar 19, 2026
86 of 103 checks passed
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants