[ROCm][perf] Shuffle KV cache to use paged_attention_common#1
[ROCm][perf] Shuffle KV cache to use paged_attention_common#1
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
e31d268 to
ea196ed
Compare
Signed-off-by: Samu Tamminen <stammine@amd.com>
Signed-off-by: Samu Tamminen <stammine@amd.com>
Signed-off-by: Samu Tamminen <stammine@amd.com>
Signed-off-by: Samu Tamminen <stammine@amd.com>
ea196ed to
6cf3af5
Compare
Signed-off-by: Samu Tamminen <stammine@amd.com>
Purpose
For Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, currently
VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1performs worse on small concurrencies, compared toVLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=0. This PR fixes the issue usingpaged_attention_commonfrom aiter (see ROCm/aiter#1821).Test Plan
For input and output lengths of 1k and 8k and concurrencies from 8, 18, 32, 64, 128, compare current main branch with and without VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT (_vllm_main_shuffle1 and _vllm_main_shuffle0, respectively) to changes of this PR (_pr_shuffle1).
Also verified on MI355.
Also verified for Qwen/Qwen3-235B-A22B-Instruct-2507.
Test Result
For input length 8k and output length 1k (green lines), the changes of this PR (_pr_shuffle1, the solid line) outperform main branch, with or without shuffle kv cache.
For input length 1k and output length 8k (orange lines), the changes of this PR (_pr_shuffle1, the solid line) outperform main branch, with or without shuffle kv cache.
For input length 1k and output length 1k (blue lines), the changes of this PR (_pr_shuffle1, the solid line) are very close to main branch. This might require further adjustment in aiter
paged_attention_common.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.