[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS by elvischenv · Pull Request #15729 · sgl-project/sglang

elvischenv · 2025-12-24T07:24:59Z

Motivation

This PR is to support Flashinfer rope_quantize_fp8_append_paged_kv_cache kernel for trtllm_mha backend and enable it on GPT-OSS.

Depends on Flashinfer 0.6.0: #15551

Accuracy

PR

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251223_191831', 'metric': 0.9166666666666666}]

main

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251223_193331', 'metric': 0.9083333333333333}]

Perf (GPT-OSS-120b TP8 con8)

PR: 5.5% perf gain

Median TPOT (ms):                        2.91

main

Median TPOT (ms):                        3.07

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T07:25:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

nvpohanh · 2026-03-12T11:56:30Z

This can be reviewed together with #19451 . They are very similar, except that one is for trtllm_mha and one is for trtllm_mla

Fridge003 · 2026-03-20T21:46:39Z

For the accuracy results, which model are you testing on?
Can you please post accuracy results for MTP, to make sure its acceptance length doesn't drop

Fridge003 · 2026-03-20T21:42:30Z

python/sglang/srt/layers/attention/base_attn_backend.py

        return None
+
+    def support_rope_fusion(self) -> bool:
+        """Check if the current backend supports RoPE fusion."""


Instead of adding this method in base class, can we control this fusion with an environ flag?
Now it is set to False by default. After this feature stabilizes it can be turned on by default

github-actions bot added the blackwell SM100/SM120 label Dec 24, 2025

nvpohanh mentioned this pull request Dec 24, 2025

[Tracking] GPT-OSS B200/GB200 performance optimization tracker #15243

Open

11 tasks

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 5e5c50f to 7cc00cb Compare February 7, 2026 12:42

elvischenv marked this pull request as ready for review February 7, 2026 12:43

elvischenv requested review from Fridge003, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners February 7, 2026 12:43

ispobock assigned Qiaolin-Yu Feb 18, 2026

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 7cc00cb to 191dcf2 Compare February 24, 2026 04:45

elvischenv requested a review from HaiShaw as a code owner February 24, 2026 04:45

elvischenv added 3 commits February 25, 2026 19:33

clean up trtllm mha backend and use NHD kv layout

8e942ab

support Flashinfer rope_quantize_fp8 + append_paged_kv_cache for GPT-OSS

91e1a1b

support Flashinfer rope_quantize_fp8_append_paged_kv_cache

e59267f

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 191dcf2 to e59267f Compare February 26, 2026 03:34

nvpohanh mentioned this pull request Mar 12, 2026

Integrate fused flashinfer rope_quantize_fp8_append_paged_kv_cache kernel #19451

Open

5 tasks

Fridge003 reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729
elvischenv wants to merge 3 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv

elvischenv commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

nvpohanh commented Mar 12, 2026

Uh oh!

Fridge003 commented Mar 20, 2026

Uh oh!

Fridge003 Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

elvischenv commented Dec 24, 2025

Motivation

Accuracy

Perf (GPT-OSS-120b TP8 con8)

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

nvpohanh commented Mar 12, 2026

Uh oh!

Fridge003 commented Mar 20, 2026

Uh oh!

Fridge003 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants