[Perf] Optimize `reshape_and_cache` CUDA Kernel #25955

ZJY0516 · 2025-09-30T14:19:57Z

Purpose

FIX #25705
Optimize reshape_and_cache CUDA Kernel.
Separate key/value loops - Allows specialized indexing for each

Test

pytest -s tests/kernels/attention/test_cache.py::test_reshape_and_cache

passed

gsm8k

vllm (pretrained=/data/datasets/models-hf/Qwen3-4B-Instruct-2507-FP8/,max_model_len=32768,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8378|±  |0.0102|
|     |       |strict-match    |     5|exact_match|↑  |0.8415|±  |0.0101|
this pr
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8378|±  |0.0102|
|     |       |strict-match    |     5|exact_match|↑  |0.8415|±  |0.0101|

Performance

test on L40

python benchmarks/kernels/benchmark_reshape_and_cache.py

num_tokens	Old Run (µs)	New Run (µs)	Change (%)
2	19.359	12.620	-34.8% 🚀
4	19.429	12.626	-35.0% 🚀
8	20.393	12.719	-37.6% 🚀
16	20.658	12.741	-38.3% 🚀
32	20.577	12.671	-38.4% 🚀
64	29.261	15.197	-48.1% 🚀
128	272.679	22.695	-91.7% 🚀
256	731.026	39.518	-94.6% 🚀
512	1463.500	305.959	-79.1% 🚀
1024	2940.649	802.994	-72.7% 🚀
2048	5919.389	1569.302	-73.5% 🚀
4096	11835.899	2958.320	-75.0% 🚀
8192	23642.131	5683.640	-76.0% 🚀
16384	47338.031	10847.823	-77.1% 🚀
32768	94760.791	21208.416	-77.6% 🚀
65536	189625.249	41908.552	-77.9% 🚀

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <[email protected]>

gemini-code-assist

Code Review

This pull request optimizes the reshape_and_cache CUDA kernel by vectorizing the key cache update. The change splits the main loop, creating a separate, vectorized loop for key updates to achieve coalesced memory access, while the value update logic remains in a separate loop. This is a sound optimization strategy. My review found one issue: an unnecessary header file is included, which should be removed to improve code hygiene and reduce dependencies.

csrc/cache_kernels.cu

yewentao256

Thanks for the work!
Could you also add more test results like #22036?

Unit test, model eval, kernel perf comparison etc.

ZJY0516 · 2025-09-30T15:38:43Z

We alread have unit test in test_cache.py
I will do it latter for model eval and perf comparison.
I was wondering how to generate performance comparison table in #22036.

ZJY0516 · 2025-09-30T15:41:34Z

And do you have any suggestions for further optimization? I'm not sure if this is sufficient, and I'm keen to explore any potential improvements you might see.

yewentao256 · 2025-09-30T15:52:50Z

We alread have unit test in test_cache.py I will do it latter for model eval and perf comparison. I was wondering how to generate performance comparison table in #22036.

@ZJY0516 I used benchmarks/kernels/benchmark_reshape_and_cache_flash.py to get the speed perf, and then combine the data together using LLM.

yewentao256 · 2025-09-30T15:53:56Z

And do you have any suggestions for further optimization? I'm not sure if this is sufficient, and I'm keen to explore any potential improvements you might see.

Don't worry about perf too much now, firstly set up the pipeline to validate and get performance. And later experiment would be much easier

Signed-off-by: zjy0516 <[email protected]>

kernel Signed-off-by: zjy0516 <[email protected]>

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 · 2025-10-02T02:44:36Z

cc @Liu-congo

Liu-congo · 2025-10-02T02:51:30Z

cc @Liu-congo

Awesome! thank you so much!

yewentao256

Looks good to me, thanks for the work!

csrc/cache_kernels.cu

Signed-off-by: zjy0516 <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: Karan Goel <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

optimize

2ad668d

Signed-off-by: zjy0516 <[email protected]>

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

yewentao256 reviewed Sep 30, 2025

View reviewed changes

ZJY0516 and others added 3 commits October 1, 2025 22:09

add benchmark

5c83cfc

Signed-off-by: zjy0516 <[email protected]>

vectorize the reshape_and_cache kernel while adding a benchmark for the

3065c5f

kernel Signed-off-by: zjy0516 <[email protected]>

clang formatting the new kernel

edff9dc

Signed-off-by: zjy0516 <[email protected]>

mergify bot added the performance Performance-related issues label Oct 1, 2025

ZJY0516 added 2 commits October 1, 2025 23:51

refactor

df2e05e

Signed-off-by: zjy0516 <[email protected]>

remove layout in benchmark file

907812c

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 requested a review from yewentao256 October 2, 2025 02:44

yewentao256 approved these changes Oct 2, 2025

View reviewed changes

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025

ZJY0516 added 4 commits October 2, 2025 22:45

remove unnecessary includes

dc49182

Signed-off-by: zjy0516 <[email protected]>

Merge branch 'main' into reshape_and_cache

10d3922

Signed-off-by: zjy0516 <[email protected]>

Merge branch 'main' into reshape_and_cache

5980712

Merge branch 'main' into reshape_and_cache

0cdf969

Signed-off-by: zjy0516 <[email protected]>

vllm-bot merged commit eb0fa43 into vllm-project:main Oct 3, 2025
82 of 84 checks passed

ZJY0516 deleted the reshape_and_cache branch October 3, 2025 08:35

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (#25955)

5b80f22

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: yewentao256 <[email protected]>

karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (vllm-project#25955)

ab3152f

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: Karan Goel <[email protected]>

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (vllm-project#25955)

e5d9a93

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (vllm-project#25955)

a2f898f

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (vllm-project#25955)

405365b

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Perf] Optimize reshape_and_cache CUDA Kernel (vllm-project#25955)

f228bba

Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Optimize `reshape_and_cache` CUDA Kernel #25955

[Perf] Optimize `reshape_and_cache` CUDA Kernel #25955

Uh oh!

ZJY0516 commented Sep 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

ZJY0516 commented Sep 30, 2025

Uh oh!

ZJY0516 commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

ZJY0516 commented Oct 2, 2025

Uh oh!

Liu-congo commented Oct 2, 2025 •

edited

Loading

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Perf] Optimize reshape_and_cache CUDA Kernel #25955

[Perf] Optimize reshape_and_cache CUDA Kernel #25955

Uh oh!

Conversation

ZJY0516 commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Sep 30, 2025

Uh oh!

ZJY0516 commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

ZJY0516 commented Oct 2, 2025

Uh oh!

Liu-congo commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Perf] Optimize `reshape_and_cache` CUDA Kernel #25955

[Perf] Optimize `reshape_and_cache` CUDA Kernel #25955

ZJY0516 commented Sep 30, 2025 •

edited by github-actions bot

Loading

Liu-congo commented Oct 2, 2025 •

edited

Loading