Skip to content

cherry-pick chunked attention from #821 + 32k+ context window fix from #855#870

Closed
Luca-Calabria wants to merge 4 commits into
vllm-project:releases/v0.14.0from
Luca-Calabria:cherry-pick-chunked-attn
Closed

cherry-pick chunked attention from #821 + 32k+ context window fix from #855#870
Luca-Calabria wants to merge 4 commits into
vllm-project:releases/v0.14.0from
Luca-Calabria:cherry-pick-chunked-attn

Conversation

@Luca-Calabria
Copy link
Copy Markdown
Contributor

@Luca-Calabria Luca-Calabria commented Jan 23, 2026

Cherry pick missing fixes:
chunked attention fixes from #821
llama4 32k+ context window #855

Copilot AI review requested due to automatic review settings January 23, 2026 15:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for chunked attention by cherry-picking fixes from PR #821. The changes implement the infrastructure needed to handle models that use chunked attention patterns, ensuring proper attention bias computation and block mapping for both prefill and decode phases.

Changes:

  • Added chunked attention detection and initialization logic
  • Extended attention metadata structures with chunked-specific fields
  • Implemented chunked attention bias computation for prefill and decode phases

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Core implementation of chunked attention support including model detection, metadata processing, attention bias computation, and block mapping
vllm_gaudi/v1/spec_decode/hpu_eagle.py Added chunked attention metadata fields to speculative decoding
vllm_gaudi/v1/attention/backends/hpu_attn.py Updated attention metadata creation with chunked attention parameters
vllm_gaudi/attention/backends/hpu_attn.py Extended attention metadata dataclass and added chunked attention handling in forward pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
@Luca-Calabria Luca-Calabria marked this pull request as draft January 23, 2026 15:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@Luca-Calabria Luca-Calabria changed the title cherry-pick chunked attention from #821 cherry-pick chunked attention from #821 + 32k+ context window fix from #855 Jan 26, 2026
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@Luca-Calabria Luca-Calabria closed this by deleting the head repository Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants