Skip to content

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729

Open
elvischenv wants to merge 3 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv
Open

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729
elvischenv wants to merge 3 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv

Conversation

@elvischenv
Copy link
Contributor

Motivation

This PR is to support Flashinfer rope_quantize_fp8_append_paged_kv_cache kernel for trtllm_mha backend and enable it on GPT-OSS.

Depends on Flashinfer 0.6.0: #15551

Accuracy

PR

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251223_191831', 'metric': 0.9166666666666666}]

main

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251223_193331', 'metric': 0.9083333333333333}]

Perf (GPT-OSS-120b TP8 con8)

PR: 5.5% perf gain

Median TPOT (ms):                        2.91

main

Median TPOT (ms):                        3.07

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the blackwell SM100/SM120 label Dec 24, 2025
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 5e5c50f to 7cc00cb Compare February 7, 2026 12:42
@elvischenv elvischenv marked this pull request as ready for review February 7, 2026 12:43
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 7cc00cb to 191dcf2 Compare February 24, 2026 04:45
@elvischenv elvischenv requested a review from HaiShaw as a code owner February 24, 2026 04:45
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 191dcf2 to e59267f Compare February 26, 2026 03:34
@nvpohanh
Copy link
Collaborator

This can be reviewed together with #19451 . They are very similar, except that one is for trtllm_mha and one is for trtllm_mla

@Fridge003
Copy link
Collaborator

For the accuracy results, which model are you testing on?
Can you please post accuracy results for MTP, to make sure its acceptance length doesn't drop

return None

def support_rope_fusion(self) -> bool:
"""Check if the current backend supports RoPE fusion."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this method in base class, can we control this fusion with an environ flag?
Now it is set to False by default. After this feature stabilizes it can be turned on by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants