Optimized Paged Attention for HPU by kzawora-intel · Pull Request #4 · HabanaAI/vllm-fork

kzawora-intel · 2024-02-20T09:51:45Z

This PR adds Paged Attention implementation for Gaudi2. In standalone benchmark (B=32, Mkv=1024, Hkv=32, K=128) it showed 200x performance improvement over current implementation:

Current (naive) version:

INFO:root:[REQ:0][B:32, Mq:1, Mkv:996, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 106.500s
INFO:root:[REQ:1][B:32, Mq:1, Mkv:1006, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 46.476s
INFO:root:[REQ:2][B:32, Mq:1, Mkv:1015, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 21.430s
INFO:root:[REQ:3][B:32, Mq:1, Mkv:953, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 12.854s
INFO:root:[REQ:4][B:32, Mq:1, Mkv:1008, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 9.996s
INFO:root:[REQ:5][B:32, Mq:1, Mkv:980, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 5.928s
INFO:root:[REQ:6][B:32, Mq:1, Mkv:1024, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 8.595s
INFO:root:[REQ:7][B:32, Mq:1, Mkv:1015, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 5.835s
INFO:root:[REQ:8][B:32, Mq:1, Mkv:995, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 6.296s
INFO:root:[REQ:9][B:32, Mq:1, Mkv:1023, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 5.833s
INFO:root:[ALL:10][B:(32-32), Mq:1, Mkv:(953-1024), Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 229.758s

Optimized version:

INFO:root:[REQ:0][B:32, Mq:1, Mkv:996, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.959s
INFO:root:[REQ:1][B:32, Mq:1, Mkv:1006, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.022s
INFO:root:[REQ:2][B:32, Mq:1, Mkv:1015, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:3][B:32, Mq:1, Mkv:953, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:4][B:32, Mq:1, Mkv:1008, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:5][B:32, Mq:1, Mkv:980, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:6][B:32, Mq:1, Mkv:1024, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:7][B:32, Mq:1, Mkv:1015, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[REQ:8][B:32, Mq:1, Mkv:995, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.021s
INFO:root:[REQ:9][B:32, Mq:1, Mkv:1023, Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 0.020s
INFO:root:[ALL:10][B:(32-32), Mq:1, Mkv:(953-1024), Hq:32, Hkv:32, K:128] paged_attention_op_time (hpu): 1.149s

mdvoretc-intel · 2024-02-20T12:15:34Z

vllm/hpu/ops.py

+            ## hard override for filler. These blocks would contribute nothing to the output due to zero attention_probs and will clog up compute resources
+            # with torch.profiler.record_function('block_seq_len_check'):
+            #    if (block_index - 2) * block_size > torch.max(context_lens):
+            #        break


Please remove this block. It's unlikely to come back into use, even if a new check against overcomputing is introduced.

mdvoretc-intel · 2024-02-20T12:19:40Z

vllm/hpu/ops.py

+        with torch.profiler.record_function(f"block_loop"):
+            with torch.profiler.record_function("seq_index_fill"):


Do these profiler labels have any effect when not profiling this code?

mdvoretc-intel · 2024-02-20T12:22:04Z

vllm/hpu/ops.py

+                # single block attn weight of shape [B, Hq, Mq(=1), block_size], equivalent to attn_weights_blocks[i]
+                attn_weights = attn_weights_blocks.index_select(0, block_index).squeeze(0)
+
+            with torch.profiler.record_function("fetch_block_table"):


The prevalence of the profiler labels is reducing code readability, especially in this part of the code where each record_function covers a single statement.

mdvoretc-intel · 2024-02-20T12:24:35Z

vllm/hpu/ops.py

+    query = query_in
+    key_cache = key_cache_in
+    value_cache = value_cache_in


The *_in args were used in a prior version to facilitate type casting, since FP16 was causing precision issues in softmax. If that is no longer the case, then the arg names should be changed and these lines should be removed.

mdvoretc-intel · 2024-02-20T12:26:37Z

vllm/hpu/ops.py

+            htorch.core.mark_step()
+
+    # Cleanup out-of-bound weights and values
+    attn_weights_blocks_filler = torch.finfo(query.dtype).min


The "minimum value of query type" filler is also used elsewhere in this code (e.g. the attn_weights_blocks definition) and should probably be defined and commented on at the start of the function. The question also arises as to the type restrictions, since the required behavior for this filler is to produce a zero under exp(), and that behavior is not observed on FP16.

mdvoretc-intel · 2024-02-20T12:28:57Z

vllm/hpu/ops.py

+    attn_masks=None,
+) -> None:
+    sanitize_values = True
+    device = query_in.device


Does this statement produce the device type ("hpu") or a device identifier ("hpu:0")? This is relevant to its use in further checks.

mdvoretc-intel · 2024-02-20T12:31:26Z

vllm/hpu/ops.py

+                output.add_(out)
+    if device == "hpu":
+        htorch.core.mark_step()
    return output.to(dtype=query_in.dtype)


There's no type casting of the query in the current code, so this casting of the output is superfluous.

madamczyk-intel · 2024-03-19T11:36:05Z

Since #14 was merged, I guess we can close this.

Merge static fp8 branch

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Add PA optimizations

56ec652

kzawora-intel changed the title ~~Optimize Paged Attention for HPU~~ Optimized Paged Attention for HPU Feb 20, 2024

mdvoretc-intel reviewed Feb 20, 2024

View reviewed changes

kzawora-intel closed this Mar 19, 2024

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 20, 2024

kzawora-intel deleted the private/kzawora/paged_attention_optimization branch October 7, 2024 13:14

iboiko-habana added a commit that referenced this pull request Feb 28, 2025

After review #4

05bff98

tvoas referenced this pull request in tvoas/vllm-fork Mar 7, 2025

Merge pull request #4 from yangulei/merge_static_fp8

820ad6c

Merge static fp8 branch

kzawora-intel pushed a commit that referenced this pull request Jul 10, 2025

Fix hpu_model_runner based on (vllm-project#20291) (#4)

35d46d0

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Paged Attention for HPU#4

Optimized Paged Attention for HPU#4
kzawora-intel wants to merge 1 commit intohabana_mainfrom
private/kzawora/paged_attention_optimization

kzawora-intel commented Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Uh oh!

madamczyk-intel commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		with torch.profiler.record_function(f"block_loop"):
		with torch.profiler.record_function("seq_index_fill"):

Conversation

kzawora-intel commented Feb 20, 2024

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

madamczyk-intel commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants