Add support for chunked attention by kfojcik-intel · Pull Request #821 · vllm-project/vllm-gaudi

kfojcik-intel · 2026-01-15T06:35:45Z

Cherry-pick of
6e1be4e but adapted to recent changes in #526

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Copilot

Pull request overview

This PR adds support for chunked attention by adapting a cherry-picked commit to work with recent changes. The implementation introduces metadata fields, bias calculations, and block mapping logic specifically for handling chunked attention patterns.

Changes:

Added chunked attention metadata fields and processing logic to support models with chunked attention patterns
Implemented attention bias calculation for chunked attention in both prefill and decode phases
Added automatic detection and configuration of chunked attention layers based on model config

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Core implementation including metadata processing, block mapping, bias calculation, and model detection for chunked attention
vllm_gaudi/v1/spec_decode/hpu_eagle.py	Added chunked attention metadata parameters to EAGLE speculative decoding
vllm_gaudi/v1/attention/backends/hpu_attn.py	Updated attention metadata factory method to include chunked attention parameters
vllm_gaudi/attention/backends/hpu_attn.py	Added chunked attention fields to metadata class and implementation logic in attention forward pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T06:36:54Z

+    def maybe_set_chunked_attention_layers(self, model):
+        if hasattr(model.config, 'text_config') and \
+           hasattr(model.config.text_config, 'attention_chunk_size') and \
+           model.config.text_config.attention_chunk_size:
+            self.model_has_chunked_attention = True
+            try:
+                for layer in model.language_model.model.layers:
+                    if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__:
+                        layer.self_attn.attn.impl.is_chunked_attention = True
+            except Exception:
+                pass


The bare except Exception: pass silently suppresses all errors without logging. This makes debugging difficult if the chunked attention setup fails. Add logging to record when this exception occurs, including the exception details.

Copilot · 2026-01-15T06:36:54Z

                    padded_batch_size * num_tokens)

+        if self.model_has_chunked_attention:
+            chunk_size_in_blocks = (self.model.model.config.text_config.attention_chunk_size // self.block_size)


Division could result in zero if attention_chunk_size is smaller than block_size, leading to incorrect chunking behavior. Add validation to ensure attention_chunk_size is at least equal to block_size or handle the zero case appropriately.

Suggested change

chunk_size_in_blocks = (self.model.model.config.text_config.attention_chunk_size // self.block_size)

attention_chunk_size = self.model.model.config.text_config.attention_chunk_size

if attention_chunk_size < self.block_size:

raise ValueError(

f"Configured attention_chunk_size ({attention_chunk_size}) must be at least "

f"as large as block_size ({self.block_size}) when using chunked attention."

)

chunk_size_in_blocks = attention_chunk_size // self.block_size

Copilot · 2026-01-15T06:36:54Z

+            max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0)
+            max_context_len = max_context_len * self.block_size


Division could result in zero or incorrect value if block_list.size(-1) is smaller than batch_size. This could lead to incorrect attention bias calculation. Add validation or use math.ceil for the division to ensure proper handling of partial blocks.

Suggested change

max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0)

max_context_len = max_context_len * self.block_size

if block_list is not None and batch_size > 0:

# Compute number of blocks per sequence using ceiling division to handle partial blocks.

blocks_per_seq = math.ceil(block_list.size(-1) / batch_size)

max_context_len = blocks_per_seq * self.block_size

else:

max_context_len = 0

Copilot · 2026-01-15T06:36:55Z

+                (past_indices.unsqueeze(0).unsqueeze(0) > invalid_lens_t.unsqueeze(-1)) &
+                (past_indices.unsqueeze(0).unsqueeze(0) < context_lens_t.unsqueeze(-1).unsqueeze(-1))).unsqueeze(1)
+
+            causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift)


Indexing with [0] assumes which_chunk has at least one element. While this may be guaranteed by the context, the assumption is not immediately clear. Consider adding a comment explaining why the first element is used or add an assertion to document this assumption.

Suggested change

causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift)

causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift)

# which_chunk is expected to have at least one row (batch dimension > 0) in this code path.

assert which_chunk.size(0) > 0, "which_chunk is expected to have at least one row"

github-actions · 2026-01-15T08:05:12Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

github-actions · 2026-01-15T12:31:07Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

github-actions · 2026-01-16T03:31:07Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
4c1c501a7ee1d5efbad945ea62a702ce5cefb799

github-actions · 2026-01-16T14:59:52Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

github-actions · 2026-01-16T17:06:43Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

github-actions · 2026-01-20T12:35:27Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-20T18:26:13Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

github-actions · 2026-01-22T07:18:08Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

…#855 (#881) Cherry pick missing fixes: chunked attention fixes from #821 llama4 32k+ context window #855 --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com>

Cherry-pick of vllm-project@6e1be4e but adapted to recent changes in vllm-project#526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…ndow fix from vllm-project#855 (vllm-project#881) Cherry pick missing fixes: chunked attention fixes from vllm-project#821 llama4 32k+ context window vllm-project#855 --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

Cherry-pick of vllm-project@6e1be4e but adapted to recent changes in vllm-project#526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>

Cherry-pick of 6e1be4e but adapted to recent changes in #526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

kfojcik-intel added 2 commits January 15, 2026 07:49

Add support for chunked attention

bc27d3f

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Precommit fixes

4318b82

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Copilot AI review requested due to automatic review settings January 15, 2026 06:35

kfojcik-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 15, 2026 06:35

kfojcik-intel changed the title ~~Dev/kfojcik/chunked attn~~ Add support for chunked attention Jan 15, 2026

Copilot AI reviewed Jan 15, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jan 15, 2026

🚦 Team Review Dashboard #701

Open

Merge branch 'main' into dev/kfojcik/chunked_attn

3ec1172

ksmusz approved these changes Jan 15, 2026

View reviewed changes

Merge branch 'main' into dev/kfojcik/chunked_attn

d943557

Merge branch 'main' into dev/kfojcik/chunked_attn

77e3d2c

Merge branch 'main' into dev/kfojcik/chunked_attn

55f5ed3

kfojcik-intel added 4 commits January 19, 2026 11:09

Merge branch 'main' into dev/kfojcik/chunked_attn

bdc4319

Merge branch 'main' into dev/kfojcik/chunked_attn

e78da09

Empty commit

db9ac06

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

empty

e20b15e

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Merge branch 'main' into dev/kfojcik/chunked_attn

62a7b6b

kfojcik-intel added 2 commits January 21, 2026 10:33

Merge branch 'main' into dev/kfojcik/chunked_attn

b67e943

Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

Merge branch 'main' into dev/kfojcik/chunked_attn

8af3617

ksmusz merged commit 7e97f22 into vllm-project:main Jan 23, 2026
53 checks passed

This was referenced Jan 23, 2026

cherry-pick chunked attention from #821 + 32k+ context window fix from #855 #870

Closed

Add support for chunked attention (#597) #683

Closed

cherry-pick chunked attention from #821 + 32k+ context window fix from #855 #881

Merged

adobrzyn pushed a commit that referenced this pull request Mar 31, 2026

Add support for chunked attention (#821)

fd782fb

Cherry-pick of 6e1be4e but adapted to recent changes in #526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>

-            chunk_size_in_blocks = (self.model.model.config.text_config.attention_chunk_size // self.block_size)
+            attention_chunk_size = self.model.model.config.text_config.attention_chunk_size
+            if attention_chunk_size < self.block_size:
+                raise ValueError(
+                    f"Configured attention_chunk_size ({attention_chunk_size}) must be at least "
+                    f"as large as block_size ({self.block_size}) when using chunked attention."
+                )
+            chunk_size_in_blocks = attention_chunk_size // self.block_size

		max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0)
		max_context_len = max_context_len * self.block_size

-            max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0)
-            max_context_len = max_context_len * self.block_size
+            if block_list is not None and batch_size > 0:
+                # Compute number of blocks per sequence using ceiling division to handle partial blocks.
+                blocks_per_seq = math.ceil(block_list.size(-1) / batch_size)
+                max_context_len = blocks_per_seq * self.block_size
+            else:
+                max_context_len = 0

Conversation

kfojcik-intel commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jan 15, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 15, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 16, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 16, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 16, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 20, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Jan 20, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 22, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants