Skip to content

Add support for chunked attention (#597)#809

Merged
mgawarkiewicz-intel merged 6 commits into
vllm-project:releases/v0.13.0from
kfojcik-intel:dev/kfojcik/chunked_attn_0_13_0
Jan 15, 2026
Merged

Add support for chunked attention (#597)#809
mgawarkiewicz-intel merged 6 commits into
vllm-project:releases/v0.13.0from
kfojcik-intel:dev/kfojcik/chunked_attn_0_13_0

Conversation

@kfojcik-intel
Copy link
Copy Markdown
Contributor

Cherry-pick of
6e1be4e


Cherry-pick of
vllm-project@6e1be4e

---------

Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai>
Signed-off-by: Jan Kaniecki <jan.kaniecki@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 13, 2026 08:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for chunked attention to the vLLM-Gaudi implementation by cherry-picking a commit from the upstream repository. Chunked attention divides sequences into chunks and applies attention mechanisms within and across chunks, which can improve memory efficiency and performance for long sequences.

Changes:

  • Added chunked attention bias calculation and block mapping methods
  • Extended attention metadata structures to include chunked attention fields
  • Integrated chunked attention support into the model execution pipeline

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Core implementation of chunked attention logic including bias calculation, block mapping, metadata updates, and model initialization
vllm_gaudi/v1/attention/backends/hpu_attn.py Updated attention metadata factory method to accept chunked attention parameters
vllm_gaudi/attention/backends/hpu_attn.py Extended metadata dataclass with chunked attention fields and integrated chunked attention into forward pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
self.model_has_chunked_attention = True
try:
for layer in model.language_model.model.layers:
if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__:
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bare except Exception: without logging or handling the specific exception (lines 1613-1614) silently suppresses all errors. Consider logging the exception or catching specific exception types to aid debugging and avoid masking unexpected failures.

Copilot uses AI. Check for mistakes.
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
self.scheduler_output: SchedulerOutput | None = None
self.warmup_mode: bool = False
self.batch_changed: bool = False
# WA for chunked attention support
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment abbreviation 'WA' is unclear. Consider expanding to 'Workaround' or providing more context about why chunked attention requires special handling.

Suggested change
# WA for chunked attention support
# Workaround flag for chunked attention support; toggled when special handling is required

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@wpyszka wpyszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed in 0.13

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
@iboiko-habana
Copy link
Copy Markdown
Collaborator

run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
72506c98349d6bcd32b4e33eec7b5513453c1502

@mgawarkiewicz-intel mgawarkiewicz-intel merged commit 620600d into vllm-project:releases/v0.13.0 Jan 15, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants