[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

naed90 · 2023-07-10T22:20:25Z

Instead of having each thread group fetch the query head (which causes 64x memory to be read), we have all threads in the block share the task of loading the query head. On the benchmark of running 1000 sequences through LLaMA13B on an A100 (80GB), this improves the throughput by 1.10x.

single_query_cached_kv_attention kernel

naed90 · 2023-07-10T22:23:01Z

See #421 for a detailed description and analysis of this commit.

zhyncs · 2023-07-11T10:58:28Z

Hi @naed90

overall LGTM, just had one small nitpick and looks like some formatting issues to address

naed90 · 2023-07-11T12:28:29Z

Hi @naed90

overall LGTM, just had one small nitpick and looks like some formatting issues to address

ty.
can't seem to find your review, can you send a link to it?

naed90 · 2023-07-13T13:34:35Z

@WoosukKwon @zhuohan123 hey, what do you think?

WoosukKwon · 2023-07-13T13:38:17Z

Hey @naed90, thanks for submitting the PR and apologies for the late response. I was busy for the last few days. Will take a look your issue and PR today.

naed90 · 2023-07-18T10:45:11Z

Hey @naed90, thanks for submitting the PR and apologies for the late response. I was busy for the last few days. Will take a look your issue and PR today.

@WoosukKwon bump :)

zhuohan123 · 2023-07-24T22:25:33Z

Tested a bit on the latency side:

Before optimization

$ python benchmark_latency.py --model huggyllama/llama-13b --input-len 128 --output-len 128 --num-iters 20
Namespace(model='huggyllama/llama-13b', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=20, trust_remote_code=False)
INFO 07-24 21:53:31 llm_engine.py:67] Initializing an LLM engine with config: model='huggyllama/llama-13b', tokenizer='huggyllama/llama-13b', tokenizer_mode=auto, trust_remote_code=False, dtype=t
orch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-24 21:53:31 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/l
lama-tokenizer' instead of the original tokenizer.
INFO 07-24 21:54:01 llm_engine.py:183] # GPU blocks: 899, # CPU blocks: 327
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:11<00:00,  3.56s/it]
Avg latency: 3.5580986022949217 seconds

After optimization

$ python benchmark_latency.py --model huggyllama/llama-13b --input-len 128 --output-len 128
--num-iters 20
Namespace(model='huggyllama/llama-13b', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=20, trust_remote_code=False)    INFO 07-24 21:55:36 llm_engine.py:67] Initializing an LLM engine with config: model='huggyllama/llama-13b', tokenizer='huggyllama/llama-13b', tokenizer_mode=auto, trust_remote_code=False, dtype=t
orch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)                                                                                    INFO 07-24 21:55:36 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/l
lama-tokenizer' instead of the original tokenizer.
INFO 07-24 21:56:08 llm_engine.py:183] # GPU blocks: 899, # CPU blocks: 327
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:09<00:00,  3.49s/it]
Avg latency: 3.4891188383102416 seconds

zhuohan123

Thank you for your contribution! Left some small comments. We should be able to merge this after the changes.

zhuohan123 · 2023-07-24T22:47:50Z

csrc/attention/attention_kernels.cu

  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
-  Q_vec q_vecs[NUM_VECS_PER_THREAD];
+  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
+  if (thread_group_idx <= NUM_THREAD_GROUPS_LOWER_BOUND) {


This if seems redundant if we assume NUM_THREADS should is divisible by THREAD_GROUP_SIZE?

Replaced with an assert.

csrc/attention/attention_kernels.cu

Co-authored-by: Zhuohan Li <[email protected]>

zhuohan123

LGTM! Thank you again for your hard work and detailed profiling!

…llm-project#420)

Supporting PR for HabanaAI/vllm-hpu-extension#14

mchehega

Bank conflict in loading q_vecs.

mchehega · 2025-09-25T04:33:41Z

csrc/attention/attention_kernels.cu

-    q_vecs[i] = *reinterpret_cast<const Q_vec*>(q_ptr + vec_idx * VEC_SIZE);
+      for (int i = thread_group_idx; i < NUM_VECS_PER_THREAD; i += NUM_THREAD_GROUPS_LOWER_BOUND) {
+      const int vec_idx = thread_group_offset + i * THREAD_GROUP_SIZE;
+        q_vecs[thread_group_offset][i] = *reinterpret_cast<const Q_vec*>(q_ptr + vec_idx * VEC_SIZE);


I have questions about loading query from gmem to q_vecs in smem. Threads within a thread group are neighbored in thread dim, writing q_vecs[thread_group_offset][i] will cause grouped threads accessing the same bank in smem (assume NUM_VECS_PER_THREAD * VEC_SIZE * sizeof(scalar_t) % 32 == 0). Can this be fixed by saving q_vecs in col-major, or is this a misunderstanding?

[OPTIMIZATION] Optimizes the

1f256ee

single_query_cached_kv_attention kernel

naed90 mentioned this pull request Jul 10, 2023

+34% higher throughput? #421

Closed

zhuohan123 requested changes Jul 25, 2023

View reviewed changes

OlivierDehaene added a commit to OlivierDehaene/vllm that referenced this pull request Jul 28, 2023

merge from vllm-project#420

084ca75

naed90 and others added 2 commits August 4, 2023 11:57

Rename NUM_THREAD_GROUPS_LOWER_BOUND

b27f396

Co-authored-by: Zhuohan Li <[email protected]>

Assumes NUM_THREADS % THREAD_GROUP_SIZE == 0

455038f

zhuohan123 approved these changes Aug 4, 2023

View reviewed changes

zhuohan123 merged commit 79af7e9 into vllm-project:main Aug 4, 2023

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (v…

f66bdd9

…llm-project#420)

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Oct 24, 2024

Add support for various softmax normalization options (vllm-project#420)

7f58ad1

Supporting PR for HabanaAI/vllm-hpu-extension#14

groenenboomj referenced this pull request in opendatahub-io/vllm Feb 27, 2025

Initial attempt to adjust codeowners to the ROCm fork (#420)

5f8d758

mchehega reviewed Sep 25, 2025

View reviewed changes

Uh oh!

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

Uh oh!

Conversation

naed90 commented Jul 10, 2023

Uh oh!

naed90 commented Jul 10, 2023

Uh oh!

zhyncs commented Jul 11, 2023

Uh oh!

naed90 commented Jul 11, 2023

Uh oh!

naed90 commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Jul 13, 2023

Uh oh!

naed90 commented Jul 18, 2023

Uh oh!

zhuohan123 commented Jul 24, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

naed90 Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

mchehega left a comment

Choose a reason for hiding this comment

Uh oh!

mchehega Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

naed90 commented Jul 13, 2023 •

edited

Loading

naed90 Aug 4, 2023 •

edited

Loading