[ROCm] Add Aiter PagedAttention with `Sliding Window` support by sammysun0711 · Pull Request #28719 · vllm-project/vllm

sammysun0711 · 2025-11-14T09:28:12Z

Purpose

This PR aim to add Aiter PagedAttention (PA) with sliding window support, which can fix google/gemma-3-27b-it bfloat16 model accuracy issue with Aiter PA.

google/gemma-3-27b-it trained with additional parameter "sliding_window": 1024.

Triton unified attention (default) unified_attention passed sliding windows as input:

vllm/vllm/v1/attention/backends/triton_attn.py

Lines 346 to 366 in 30700b1

    
           unified_attention( 
        
               q=query[:num_actual_tokens], 
        
               k=key_cache, 
        
               v=value_cache, 
        
               out=output[:num_actual_tokens], 
        
               cu_seqlens_q=cu_seqlens_q, 
        
               max_seqlen_q=max_seqlen_q, 
        
               seqused_k=seqused_k, 
        
               max_seqlen_k=max_seqlen_k, 
        
               softmax_scale=self.scale, 
        
               causal=True, 
        
               alibi_slopes=self.alibi_slopes, 
        
               window_size=self.sliding_window, 
        
               block_table=block_table, 
        
               softcap=self.logits_soft_cap, 
        
               q_descale=None,  # Not supported 
        
               k_descale=layer._k_scale.expand(descale_shape), 
        
               v_descale=layer._v_scale.expand(descale_shape), 
        
               sinks=self.sinks, 
        
               output_scale=output_scale, 
        
           )

Aiter PA torch.ops.aiter.paged_attention_v1 does not pass sliding_windows as input:

vllm/vllm/v1/attention/backends/rocm_aiter_fa.py

Lines 810 to 828 in 30700b1

    
           torch.ops.aiter.paged_attention_v1( 
        
               output[:num_decode_tokens], 
        
               workspace_buffer, 
        
               query[:num_decode_tokens], 
        
               key_cache, 
        
               value_cache, 
        
               self.scale, 
        
               attn_metadata.block_table[:num_decodes], 
        
               attn_metadata.query_start_loc[:num_decodes], 
        
               attn_metadata.seq_lens[:num_decodes], 
        
               attn_metadata.max_seq_len, 
        
               self.alibi_slopes, 
        
               self.kv_cache_dtype, 
        
               "NHD", 
        
               self.logits_soft_cap, 
        
               layer._k_scale, 
        
               layer._v_scale, 
        
               None, 
        
               _PARTITION_SIZE_ROCM,

If input prompt token is large than 1024, missing handling sliding_windows cause the gemma3 accuracy degradation with Aiter PA.

To fixed gemma3 accuracy issue, following 3 PR required:

This PR aim to pass sliding_windows parameter to torch.ops.aiter.paged_attention_v1
[Fix] Add sliding window feature for paged_attention_v1 ROCm/aiter#1362 need to be merged in Aiter main branch, and vllm upgrade dockerfile to latest Aiter main branch.
[ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 #28670 need to be merge to fix Aiter MHA issue.

Open as draft PR for now since it depends on other 2 PRs.

Test Plan

lm_eval test with gsm8k dataset

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

…mtp parameter Signed-off-by: Xiake Sun <xiake.sun@amd.com>

…of min_seqlen_q

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

sammysun0711 · 2025-11-20T03:15:03Z

Sorry, need to close due to rebase issue, continue in new PR: #29065.

sammysun0711 added 7 commits November 4, 2025 15:51

Add sliding windows support for aiter paged_attention_v1

567c87b

Merge branch 'main' into add_aiter_pa_sliding_windows_support

1ec8052

Adapt AITER PA sliding_window argument position

bf1036a

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Merge branch 'main' into add_aiter_pa_sliding_windows_support

0512160

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

use keyword argument for sliding_window to avoid pass it to optional …

766fd0f

…mtp parameter Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Apply workarund for AITER MHA accuracy issue due to incorrect assgin …

71885b1

…of min_seqlen_q

Merge branch 'main' into add_aiter_pa_sliding_windows_support

e7ec542

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

mergify bot added rocm Related to AMD ROCm v1 labels Nov 14, 2025

sammysun0711 added 2 commits November 18, 2025 22:07

Fix merge conflict

d030128

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Fix rebase

79ad5df

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

sammysun0711 mentioned this pull request Nov 20, 2025

[ROCm] Support AITER paged attention with sliding_window #29065

Closed

5 tasks

sammysun0711 closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Add Aiter PagedAttention with `Sliding Window` support#28719

[ROCm] Add Aiter PagedAttention with `Sliding Window` support#28719
sammysun0711 wants to merge 9 commits intovllm-project:mainfrom
sammysun0711:add_aiter_pa_sliding_windows_support

sammysun0711 commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

sammysun0711 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	unified_attention(
	q=query[:num_actual_tokens],
	k=key_cache,
	v=value_cache,
	out=output[:num_actual_tokens],
	cu_seqlens_q=cu_seqlens_q,
	max_seqlen_q=max_seqlen_q,
	seqused_k=seqused_k,
	max_seqlen_k=max_seqlen_k,
	softmax_scale=self.scale,
	causal=True,
	alibi_slopes=self.alibi_slopes,
	window_size=self.sliding_window,
	block_table=block_table,
	softcap=self.logits_soft_cap,
	q_descale=None, # Not supported
	k_descale=layer._k_scale.expand(descale_shape),
	v_descale=layer._v_scale.expand(descale_shape),
	sinks=self.sinks,
	output_scale=output_scale,
	)

	torch.ops.aiter.paged_attention_v1(
	output[:num_decode_tokens],
	workspace_buffer,
	query[:num_decode_tokens],
	key_cache,
	value_cache,
	self.scale,
	attn_metadata.block_table[:num_decodes],
	attn_metadata.query_start_loc[:num_decodes],
	attn_metadata.seq_lens[:num_decodes],
	attn_metadata.max_seq_len,
	self.alibi_slopes,
	self.kv_cache_dtype,
	"NHD",
	self.logits_soft_cap,
	layer._k_scale,
	layer._v_scale,
	None,
	_PARTITION_SIZE_ROCM,

Uh oh!

Conversation

sammysun0711 commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

sammysun0711 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sammysun0711 commented Nov 14, 2025 •

edited by github-actions bot

Loading