Skip to content

[Intel GPU] Fix incorrect KV-cache page table for local attention when page_size > 1#23757

Merged
mingfeima merged 4 commits into
sgl-project:mainfrom
ckvermaAI:xpu_backend
May 26, 2026
Merged

[Intel GPU] Fix incorrect KV-cache page table for local attention when page_size > 1#23757
mingfeima merged 4 commits into
sgl-project:mainfrom
ckvermaAI:xpu_backend

Conversation

@ckvermaAI
Copy link
Copy Markdown
Contributor

@ckvermaAI ckvermaAI commented Apr 26, 2026

Motivation

Fixes incorrect KV-cache page table values passed to make_local_attention_virtual_batches in the XPU (Intel GPU) attention backend when page_size > 1 and local (chunked) attention is enabled. This bug caused incorrect/zeroed outputs from flash_attn_with_kvcache.

Root Cause

make_local_attention_virtual_batches expects a page-granularity block table where each column p stores the physical page index for logical page p. However, the raw req_to_token table is token-granularity (column i = KV slot for tokeni).

When page_size > 1, the un-strided token-granularity table was passed directly, causing block_starts = k_seqstarts_absolute // page_size to index incorrect physical page values.

Modifications

When page_size > 1, the page table is first converted to page-granularity by:

  1. Striding: selecting every page_size-th column (torch.arange(0, ..., page_size))
  2. Dividing: integer-dividing values by page_size to convert KV slot indices to physical page indices
    This mirrors the normalization already applied to metadata.page_table elsewhere in the backend.

Changes
python/sglang/srt/layers/attention/xpu_backend.py: Add page table stride+divide normalization before passing to make_local_attention_virtual_batches when page_size > 1

Accuracy Tests

GSM8k benchmark on XPU with page_size > 1 and local attention enabled:

Metric Before After
Accuracy 0.005 0.815
Invalid 0.015 0.000

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): Run #26011291231
Latest PR Test (Extra): ⚠️ Not enabled — add run-ci-extra label to opt in.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to normalize the page table within the _init_local_attn_metadata function of the XPU backend, ensuring it is at page-granularity when page_size is greater than one. Feedback suggests refactoring this logic into a shared helper method to eliminate duplication with similar code found in init_forward_metadata.

Comment on lines +960 to +964
if self.page_size > 1:
strided_indices = torch.arange(
0, page_table.shape[1], self.page_size, device=page_table.device
)
page_table = page_table[:, strided_indices] // self.page_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The normalization logic added here (striding and floor-dividing the page table) is identical to the logic at lines 371-376 in init_forward_metadata. While necessary here because _init_local_attn_metadata is called before that main normalization block, it would be cleaner to encapsulate this logic into a helper method or move the main normalization earlier in init_forward_metadata to avoid duplication and ensure consistency across the backend.

@mingfeima
Copy link
Copy Markdown
Collaborator

@sunjiweiswift please help review this one!

@mingfeima mingfeima added intel xpu intel gpu with device `torch.xpu` run-ci labels May 18, 2026
@mingfeima mingfeima marked this pull request as ready for review May 18, 2026 03:04
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mingfeima
Copy link
Copy Markdown
Collaborator

we'd better add UT to cover radix attention level behavior, like this one:

  • test/registered/attention/test_triton_attention_backend.py
  • test/registered/attention/test_torch_native_attention_backend.py
  • test/registered/cpu/test_intel_amx_attention_backend_a.py

@ckvermaAI could you please take this job?

@ckvermaAI
Copy link
Copy Markdown
Contributor Author

@mingfeima sure, Let me add similar unit test

@sunjiweiswift
Copy link
Copy Markdown

Let me test this on more models.

@mingfeima
Copy link
Copy Markdown
Collaborator

Let me test this on more models.

@sunjiweiswift any updates?

@mingfeima mingfeima enabled auto-merge (squash) May 26, 2026 03:01
@mingfeima mingfeima disabled auto-merge May 26, 2026 03:02
@mingfeima mingfeima merged commit 156d1af into sgl-project:main May 26, 2026
112 of 121 checks passed
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

intel run-ci xpu intel gpu with device `torch.xpu`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants