-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[ROCM][KERNEL] Paged attention for V1 #15720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCM][KERNEL] Paged attention for V1 #15720
Conversation
Signed-off-by: Aleksandr Malyshev <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: root <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]>
|
llama 70B TP2 same test: this PR (--no-enable-prefix-caching): |
Signed-off-by: Aleksandr Malyshev <[email protected]>
|
Output heavy load with llama 3.1 70B python vllm/benchmarks/benchmark_serving.py --backend vllm --model /data/models/Llama-3.1-70B-Instruct --dataset-name random --random-input-len 150 --random-output-len 16384 --max-concurrency 32 --num-prompts 512 --seed 42 upstream (--no-enable-prefix-caching): this PR (--no-enable-prefix-caching): |
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, this adds support for chunked prefill to the existing ROCm custom Paged Attention, and then for prefill we still use the _fwd_kernel?
correct, we had two kernels, one for chunked prefills in a query and another kernel, which used to be only triton, for decodes in the query. This PR adds alternative to triton decode kernel for set of cases, please see limitations in use_rocm_custom_paged_attention function. As for the rocm paged attention kernel itself, the kernel (which in fact 3 C++ kernels), just skips prefills in the query now. PS of course there might be more tailored way to utilize rocm paged attention for V1, however it will take more time to update the kernel. |
Signed-off-by: Aleksandr Malyshev <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]>
|
Nice. This is consistent with my profiling of where the issues are on V1. Glad to see the update. This generally looks fine to me. I will follow up with a refactor for attn backend. @SageMoore @ProExpertProg - can you guys look through the cpp and let me know if its okay to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPP looks good to me, very contained and straightforward change 😃
^ Mistral Small, TP=1, 1000 In | 100 Out Nice step forward but still juice to squeeze |
csrc/rocm/attention.cu
Outdated
| const int64_t seq_idx64 = static_cast<int64_t>(seq_idx); | ||
| const int64_t query_start_off = | ||
| query_start_loc_ptr ? query_start_loc_ptr[seq_idx] : seq_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the static cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be conversion similar to what static cast provides? I mean when assignment happens to int64_t from int?
Signed-off-by: Aleksandr Malyshev <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: Aleksandr Malyshev <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Adopting ROCM Paged Attention to be use in V1 FA as alternative to Triton kernel. Perf I see:
Baseline (VLLM_ROCM_CUSTOM_PAGED_ATTN=0 and --no-enable-prefix-caching):
With change (--no-enable-prefix-caching):
latency:
V0
Avg latency: 40.9744499316439 seconds
V1
upstream (VLLM_ROCM_CUSTOM_PAGED_ATTN=0):
Avg latency: 46.44388557784259 seconds
with change:
Avg latency: 34.401718624557056 seconds
correctness:
2025-03-31:21:23:17,108 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=/data/models/Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
cc @SageMoore