Skip to content

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention#36858

Open
elvischenv wants to merge 3 commits intovllm-project:mainfrom
elvischenv:elvischenv/flashinfer-rope-quant-cache-fusion
Open

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention#36858
elvischenv wants to merge 3 commits intovllm-project:mainfrom
elvischenv:elvischenv/flashinfer-rope-quant-cache-fusion

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Mar 12, 2026

Purpose

Support Flashinfer RoPE+Quant+KV Cache Update fusion kernel rope_quantize_fp8_append_paged_kv_cache.

Depend on flashinfer-ai/flashinfer#2792: Fixed the padding token issue for the kernel when using full cudagraph

Test Plan && Test Result

Fusion pass unit test

pytest -v -s tests/compile/passes/test_rope_kvcache_fusion.py::test_rope_quant_kvcache_fusion

===== 24 passed, 41 warnings in 93.37s (0:01:33) ========

Model e2e accuracy

Server cmd:

VLLM_USE_FLASHINFER_ROPE=1 \
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
\
vllm serve \
openai/gpt-oss-120b \
--tensor-parallel-size 8 \
-cc.use_inductor_graph_partition=True \
-cc.pass_config.fuse_allreduce_rms=True \
-cc.pass_config.eliminate_noops=True \
-cc.pass_config.fuse_rope_kvcache=True \
--async-scheduling \
--no-enable-prefix-caching \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--max-num-seqs 1024 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 2048

Fused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_204508', 'metric': 0.7922979797979798}]

Infused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_210654', 'metric': 0.7891414141414141}]

Model e2e perf

Fused: about 5% perf gain for GPT-OSS-120b TP8 con8

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  29.01
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.76
Output token throughput (tok/s):         2824.22
Peak output token throughput (tok/s):    152.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5648.44
---------------Time to First Token----------------
Mean TTFT (ms):                          53.85
Median TTFT (ms):                        55.84
P99 TTFT (ms):                           86.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.78
Median TPOT (ms):                        2.78
P99 TPOT (ms):                           2.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.73
Median ITL (ms):                         55.41
P99 ITL (ms):                            57.44
==================================================

Infused:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  30.50
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.62
Output token throughput (tok/s):         2686.20
Peak output token throughput (tok/s):    145.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5372.41
---------------Time to First Token----------------
Mean TTFT (ms):                          58.81
Median TTFT (ms):                        63.80
P99 TTFT (ms):                           85.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49
Median ITL (ms):                         58.44
P99 ITL (ms):                            59.91
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added nvidia rocm Related to AMD ROCm v1 labels Mar 12, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces support for Flashinfer's fused RoPE, quantization, and KV cache update kernel, which is a great performance optimization for FP8 models on CUDA. The changes are well-structured, adding a new RopeQuantReshapeKVCachePattern to handle the fusion and updating related components to support it.

However, I've found a critical issue in vllm/v1/attention/backends/flashinfer.py where a check for KV cache sharing was removed, which could lead to incorrect behavior for models that use this feature. Please see my comment for details.

@mergify
Copy link

mergify bot commented Mar 16, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 16, 2026
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
@elvischenv elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from cb4d5e7 to dd6afc1 Compare March 16, 2026 09:33
@mergify mergify bot removed the needs-rebase label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

1 participant