Skip to content

Conversation

@ZJY0516
Copy link
Contributor

@ZJY0516 ZJY0516 commented Sep 30, 2025

Purpose

FIX #25705
Optimize reshape_and_cache CUDA Kernel.
Separate key/value loops - Allows specialized indexing for each

Test

pytest -s tests/kernels/attention/test_cache.py::test_reshape_and_cache

passed

gsm8k

vllm (pretrained=/data/datasets/models-hf/Qwen3-4B-Instruct-2507-FP8/,max_model_len=32768,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8378|±  |0.0102|
|     |       |strict-match    |     5|exact_match|↑  |0.8415|±  |0.0101|
this pr
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8378|±  |0.0102|
|     |       |strict-match    |     5|exact_match|↑  |0.8415|±  |0.0101|

Performance

test on L40

python benchmarks/kernels/benchmark_reshape_and_cache.py
num_tokens Old Run (µs) New Run (µs) Change (%)
2 19.359 12.620 -34.8% 🚀
4 19.429 12.626 -35.0% 🚀
8 20.393 12.719 -37.6% 🚀
16 20.658 12.741 -38.3% 🚀
32 20.577 12.671 -38.4% 🚀
64 29.261 15.197 -48.1% 🚀
128 272.679 22.695 -91.7% 🚀
256 731.026 39.518 -94.6% 🚀
512 1463.500 305.959 -79.1% 🚀
1024 2940.649 802.994 -72.7% 🚀
2048 5919.389 1569.302 -73.5% 🚀
4096 11835.899 2958.320 -75.0% 🚀
8192 23642.131 5683.640 -76.0% 🚀
16384 47338.031 10847.823 -77.1% 🚀
32768 94760.791 21208.416 -77.6% 🚀
65536 189625.249 41908.552 -77.9% 🚀

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the reshape_and_cache CUDA kernel by vectorizing the key cache update. The change splits the main loop, creating a separate, vectorized loop for key updates to achieve coalesced memory access, while the value update logic remains in a separate loop. This is a sound optimization strategy. My review found one issue: an unnecessary header file is included, which should be removed to improve code hygiene and reduce dependencies.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!
Could you also add more test results like #22036?

Unit test, model eval, kernel perf comparison etc.

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 30, 2025

We alread have unit test in test_cache.py
I will do it latter for model eval and perf comparison.
I was wondering how to generate performance comparison table in #22036.

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 30, 2025

And do you have any suggestions for further optimization? I'm not sure if this is sufficient, and I'm keen to explore any potential improvements you might see.

@yewentao256
Copy link
Member

We alread have unit test in test_cache.py I will do it latter for model eval and perf comparison. I was wondering how to generate performance comparison table in #22036.

@ZJY0516 I used benchmarks/kernels/benchmark_reshape_and_cache_flash.py to get the speed perf, and then combine the data together using LLM.

@yewentao256
Copy link
Member

And do you have any suggestions for further optimization? I'm not sure if this is sufficient, and I'm keen to explore any potential improvements you might see.

Don't worry about perf too much now, firstly set up the pipeline to validate and get performance. And later experiment would be much easier

@mergify mergify bot added the performance Performance-related issues label Oct 1, 2025
Signed-off-by: zjy0516 <[email protected]>
@ZJY0516 ZJY0516 requested a review from yewentao256 October 2, 2025 02:44
@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Oct 2, 2025

cc @Liu-congo

@Liu-congo
Copy link
Contributor

Liu-congo commented Oct 2, 2025

cc @Liu-congo

Awesome! thank you so much!

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025
@vllm-bot vllm-bot merged commit eb0fa43 into vllm-project:main Oct 3, 2025
82 of 84 checks passed
@ZJY0516 ZJY0516 deleted the reshape_and_cache branch October 3, 2025 08:35
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: zjy0516 <[email protected]>
Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
Signed-off-by: zjy0516 <[email protected]>
Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
Signed-off-by: zjy0516 <[email protected]>
Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: Karan Goel <[email protected]>
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: zjy0516 <[email protected]>
Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: zjy0516 <[email protected]>
Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: [Perf] Optimize reshape_and_cache CUDA Kernel

4 participants