Add support for chunked attention (#597) by kfojcik-intel · Pull Request #809 · vllm-project/vllm-gaudi

kfojcik-intel · 2026-01-13T08:46:50Z

Cherry-pick of
6e1be4e

Cherry-pick of vllm-project@6e1be4e --------- Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai> Signed-off-by: Jan Kaniecki <jan.kaniecki@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds support for chunked attention to the vLLM-Gaudi implementation by cherry-picking a commit from the upstream repository. Chunked attention divides sequences into chunks and applies attention mechanisms within and across chunks, which can improve memory efficiency and performance for long sequences.

Changes:

Added chunked attention bias calculation and block mapping methods
Extended attention metadata structures to include chunked attention fields
Integrated chunked attention support into the model execution pipeline

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Core implementation of chunked attention logic including bias calculation, block mapping, metadata updates, and model initialization
vllm_gaudi/v1/attention/backends/hpu_attn.py	Updated attention metadata factory method to accept chunked attention parameters
vllm_gaudi/attention/backends/hpu_attn.py	Extended metadata dataclass with chunked attention fields and integrated chunked attention into forward pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T08:48:08Z

+                    self.model_has_chunked_attention = True
+                    try:
+                        for layer in model.language_model.model.layers:
+                            if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__:


Using a bare except Exception: without logging or handling the specific exception (lines 1613-1614) silently suppresses all errors. Consider logging the exception or catching specific exception types to aid debugging and avoid masking unexpected failures.

Copilot · 2026-01-13T08:48:08Z

        self.scheduler_output: SchedulerOutput | None = None
        self.warmup_mode: bool = False
        self.batch_changed: bool = False
+        # WA for chunked attention support


The comment abbreviation 'WA' is unclear. Consider expanding to 'Workaround' or providing more context about why chunked attention requires special handling.

Suggested change

# WA for chunked attention support

# Workaround flag for chunked attention support; toggled when special handling is required

wpyszka

needed in 0.13

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

iboiko-habana · 2026-01-14T09:59:49Z

run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that

github-actions · 2026-01-15T07:50:04Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
72506c98349d6bcd32b4e33eec7b5513453c1502

kfojcik-intel requested a review from mgawarkiewicz-intel as a code owner January 13, 2026 08:46

Copilot AI review requested due to automatic review settings January 13, 2026 08:46

kfojcik-intel requested review from piotrbocian and wpyszka as code owners January 13, 2026 08:46

Copilot AI reviewed Jan 13, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jan 13, 2026

🚦 Team Review Dashboard #701

Open

wpyszka approved these changes Jan 13, 2026

View reviewed changes

Luca-Calabria reviewed Jan 13, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated

kfojcik-intel added 4 commits January 14, 2026 10:12

Add chunked_block args to hpu eagle

ca1878d

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Fix to attn chunk size check

80c3fd5

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Refactor

0ec773f

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Refactor

a52b271

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

ksmusz reviewed Jan 14, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py

Merge branch 'releases/v0.13.0' into dev/kfojcik/chunked_attn_0_13_0

eddea52

mgawarkiewicz-intel merged commit 620600d into vllm-project:releases/v0.13.0 Jan 15, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for chunked attention (#597)#809

Add support for chunked attention (#597)#809
mgawarkiewicz-intel merged 6 commits into
vllm-project:releases/v0.13.0from
kfojcik-intel:dev/kfojcik/chunked_attn_0_13_0

kfojcik-intel commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

wpyszka left a comment

Uh oh!

Uh oh!

iboiko-habana commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

	# WA for chunked attention support
	# Workaround flag for chunked attention support; toggled when special handling is required

Conversation

kfojcik-intel commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

wpyszka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iboiko-habana commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jan 15, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants