Extend paged attention to support query_len>1 #8328

vanbasten23 · 2024-10-27T22:20:04Z

This PR extends the existing paged attention kernel to support query_len>1. Additionally, it upgrades the flash attention from v1 to v2.

Test plan:

python pytorch/xla/test/test_pallas.py -v -k PallasTest.test_paged_attention_multi_queries_wrapper
python pytorch/xla/test/test_tpu_paged_attention_kernel.py 2>&1 | tee out.txt

cc: @miladm

vanbasten23 · 2024-10-28T17:42:51Z

torch_xla/experimental/custom_kernel.py

+    page_indices,  # [batch_size, pages_per_sequence]
+    num_kv_pages_per_compute_block,
+    num_queries_per_compute_block,
+    use_kernel=True,


hey @WoosukKwon, this is the integration point between vLLM and torch_xla. I'm thinking if vLLM can switch this flag use_kernel perhaps by using some flags. I want to use the nonkernel version as a per baseline. Do you know if it possible?

For dynamo, it's similar. The integration point is at def multi_queries_paged_attention_xla( in the same file.

torch_xla/experimental/pallas_kernels/multi_queries_paged_attention_kernel.py

Liyang90 · 2024-10-28T18:19:04Z

torch_xla/experimental/pallas_kernels/multi_queries_paged_attention_kernel.py

+  q_index = q_blk_idx * num_queries_per_compute_block
+  kv_index = kv_blk_idx * kv_seq_len_per_kv_compute_blk
+  kv_len = lengths_ref[b]
+  row_ids = (kv_len - query_len) + q_index + jax.lax.broadcasted_iota(


Here, we assume the input query corresponds to the last (q_len) of the input kv. For example, if the input q_len is 8, and kv_len is 24, we assume the query corresponds to the kv at index [16. 24), and applies the causal mask accordingly.

@WoosukKwon please let us know if this assumption is valid or nor for the use cases in vLLM.

Yes that's the desired behavior. Thanks for checking it out with me!

torch_xla/experimental/pallas_kernels/multi_queries_paged_attention_kernel.py

torch_xla/experimental/custom_kernel.py

… CI" This reverts commit 99d39b4.

vanbasten23 added 5 commits October 26, 2024 13:55

added the kernel and the test.

aec0d1f

all kernel tests pass.

c0082f7

integrate the new kernel to torch_xla

f41a732

add test to the tpu ci

5d2f9df

run linter

34af58e

vanbasten23 added the tpuci label Oct 27, 2024

vanbasten23 requested review from miladm and Liyang90 October 28, 2024 17:25

vanbasten23 marked this pull request as ready for review October 28, 2024 17:34

vanbasten23 commented Oct 28, 2024

View reviewed changes

Liyang90 reviewed Oct 28, 2024

View reviewed changes

vanbasten23 added 4 commits October 28, 2024 20:17

trigger the tpu ci

ee4ab99

added todo

551def2

add __init__.py to the pallas_kernel dir

d00a2c7

fix comments

e8e3395

vanbasten23 commented Oct 28, 2024

View reviewed changes

torch_xla/experimental/custom_kernel.py Show resolved Hide resolved

vanbasten23 added 5 commits October 28, 2024 23:14

debug the new kernel test succeeded locally but failed in the CI

99d39b4

Revert "debug the new kernel test succeeded locally but failed in the…

10fdc33

… CI" This reverts commit 99d39b4.

fix unknown flag -v

52c0ab0

add torch.compile support.

f846f71

linter

f0ca56d

vanbasten23 requested review from WoosukKwon and Liyang90 October 29, 2024 18:22

handle the case where query_len%num_queries_per_compute_block!=0

55cc66a

Liyang90 approved these changes Oct 30, 2024

View reviewed changes

vanbasten23 merged commit 1bac062 into master Oct 31, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend paged attention to support query_len>1 #8328

Extend paged attention to support query_len>1 #8328

vanbasten23 commented Oct 27, 2024 •

edited

Loading

vanbasten23 Oct 28, 2024

vanbasten23 Oct 29, 2024

Liyang90 Oct 28, 2024

WoosukKwon Oct 30, 2024

Extend paged attention to support query_len>1 #8328

Extend paged attention to support query_len>1 #8328

Conversation

vanbasten23 commented Oct 27, 2024 • edited Loading

vanbasten23 Oct 28, 2024

Choose a reason for hiding this comment

vanbasten23 Oct 29, 2024

Choose a reason for hiding this comment

Liyang90 Oct 28, 2024

Choose a reason for hiding this comment

WoosukKwon Oct 30, 2024

Choose a reason for hiding this comment

vanbasten23 commented Oct 27, 2024 •

edited

Loading