Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention by elvischenv · Pull Request #36858 · vllm-project/vllm

elvischenv · 2026-03-12T06:52:56Z

Purpose

Support Flashinfer RoPE+Quant+KV Cache Update fusion kernel rope_quantize_fp8_append_paged_kv_cache.

Depend on flashinfer-ai/flashinfer#2792: Fixed the padding token issue for the kernel when using full cudagraph

Test Plan && Test Result

Fusion pass unit test

pytest -v -s tests/compile/passes/test_rope_kvcache_fusion.py::test_rope_quant_kvcache_fusion

===== 24 passed, 41 warnings in 93.37s (0:01:33) ========

Model e2e accuracy

Server cmd:

VLLM_USE_FLASHINFER_ROPE=1 \
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
\
vllm serve \
openai/gpt-oss-120b \
--tensor-parallel-size 8 \
-cc.use_inductor_graph_partition=True \
-cc.pass_config.fuse_allreduce_rms=True \
-cc.pass_config.eliminate_noops=True \
-cc.pass_config.fuse_rope_kvcache=True \
--async-scheduling \
--no-enable-prefix-caching \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--max-num-seqs 1024 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 2048

Fused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_204508', 'metric': 0.7922979797979798}]

Infused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_210654', 'metric': 0.7891414141414141}]

Model e2e perf

Fused: about 5% perf gain for GPT-OSS-120b TP8 con8

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  29.01
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.76
Output token throughput (tok/s):         2824.22
Peak output token throughput (tok/s):    152.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5648.44
---------------Time to First Token----------------
Mean TTFT (ms):                          53.85
Median TTFT (ms):                        55.84
P99 TTFT (ms):                           86.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.78
Median TPOT (ms):                        2.78
P99 TPOT (ms):                           2.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.73
Median ITL (ms):                         55.41
P99 ITL (ms):                            57.44
==================================================

Infused:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  30.50
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.62
Output token throughput (tok/s):         2686.20
Peak output token throughput (tok/s):    145.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5372.41
---------------Time to First Token----------------
Mean TTFT (ms):                          58.81
Median TTFT (ms):                        63.80
P99 TTFT (ms):                           85.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49
Median ITL (ms):                         58.44
P99 ITL (ms):                            59.91
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This PR introduces support for Flashinfer's fused RoPE, quantization, and KV cache update kernel, which is a great performance optimization for FP8 models on CUDA. The changes are well-structured, adding a new RopeQuantReshapeKVCachePattern to handle the fusion and updating related components to support it.

However, I've found a critical issue in vllm/v1/attention/backends/flashinfer.py where a check for KV cache sharing was removed, which could lead to incorrect behavior for models that use this feature. Please see my comment for details.

vllm/v1/attention/backends/flashinfer.py

mergify · 2026-03-16T09:26:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

mergify bot added nvidia rocm Related to AMD ROCm v1 labels Mar 12, 2026

github-project-automation bot added this to NVIDIA and AMD Mar 12, 2026

github-project-automation bot moved this to Todo in AMD Mar 12, 2026

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

elvischenv mentioned this pull request Mar 12, 2026

[Performance]: ROPE + KV-Cache-Write + pre-attn prepare-ops fusion #24678

Open

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from 76992c4 to ed31eaa Compare March 16, 2026 03:15

mergify bot added the gpt-oss Related to GPT-OSS models label Mar 16, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 16, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Mar 16, 2026

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from ed31eaa to cb4d5e7 Compare March 16, 2026 03:16

elvischenv marked this pull request as ready for review March 16, 2026 04:55

elvischenv requested review from ProExpertProg, WoosukKwon, alexm-redhat, gshtras, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tdoublep, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners March 16, 2026 04:55

elvischenv requested review from BoyuanFeng and zou3519 as code owners March 16, 2026 04:55

mergify bot added the needs-rebase label Mar 16, 2026

elvischenv added 3 commits March 16, 2026 02:31

add VLLM_USE_FLASHINFER_ROPE

2adf219

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add unit test

a2a526a

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

support flashinfer rope+quant+cache update fusion kernel for trtllm mha

dd6afc1

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from cb4d5e7 to dd6afc1 Compare March 16, 2026 09:33

mergify bot removed the needs-rebase label Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention#36858

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention#36858
elvischenv wants to merge 3 commits intovllm-project:mainfrom
elvischenv:elvischenv/flashinfer-rope-quant-cache-fusion

elvischenv commented Mar 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

elvischenv commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Fusion pass unit test

Model e2e accuracy

Model e2e perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elvischenv commented Mar 12, 2026 •

edited by github-actions bot

Loading