Add support for chunked attention#821
Conversation
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
There was a problem hiding this comment.
Pull request overview
This PR adds support for chunked attention by adapting a cherry-picked commit to work with recent changes. The implementation introduces metadata fields, bias calculations, and block mapping logic specifically for handling chunked attention patterns.
Changes:
- Added chunked attention metadata fields and processing logic to support models with chunked attention patterns
- Implemented attention bias calculation for chunked attention in both prefill and decode phases
- Added automatic detection and configuration of chunked attention layers based on model config
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Core implementation including metadata processing, block mapping, bias calculation, and model detection for chunked attention |
| vllm_gaudi/v1/spec_decode/hpu_eagle.py | Added chunked attention metadata parameters to EAGLE speculative decoding |
| vllm_gaudi/v1/attention/backends/hpu_attn.py | Updated attention metadata factory method to include chunked attention parameters |
| vllm_gaudi/attention/backends/hpu_attn.py | Added chunked attention fields to metadata class and implementation logic in attention forward pass |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def maybe_set_chunked_attention_layers(self, model): | ||
| if hasattr(model.config, 'text_config') and \ | ||
| hasattr(model.config.text_config, 'attention_chunk_size') and \ | ||
| model.config.text_config.attention_chunk_size: | ||
| self.model_has_chunked_attention = True | ||
| try: | ||
| for layer in model.language_model.model.layers: | ||
| if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__: | ||
| layer.self_attn.attn.impl.is_chunked_attention = True | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The bare except Exception: pass silently suppresses all errors without logging. This makes debugging difficult if the chunked attention setup fails. Add logging to record when this exception occurs, including the exception details.
| padded_batch_size * num_tokens) | ||
|
|
||
| if self.model_has_chunked_attention: | ||
| chunk_size_in_blocks = (self.model.model.config.text_config.attention_chunk_size // self.block_size) |
There was a problem hiding this comment.
Division could result in zero if attention_chunk_size is smaller than block_size, leading to incorrect chunking behavior. Add validation to ensure attention_chunk_size is at least equal to block_size or handle the zero case appropriately.
| chunk_size_in_blocks = (self.model.model.config.text_config.attention_chunk_size // self.block_size) | |
| attention_chunk_size = self.model.model.config.text_config.attention_chunk_size | |
| if attention_chunk_size < self.block_size: | |
| raise ValueError( | |
| f"Configured attention_chunk_size ({attention_chunk_size}) must be at least " | |
| f"as large as block_size ({self.block_size}) when using chunked attention." | |
| ) | |
| chunk_size_in_blocks = attention_chunk_size // self.block_size |
| max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0) | ||
| max_context_len = max_context_len * self.block_size |
There was a problem hiding this comment.
Division could result in zero or incorrect value if block_list.size(-1) is smaller than batch_size. This could lead to incorrect attention bias calculation. Add validation or use math.ceil for the division to ensure proper handling of partial blocks.
| max_context_len = (block_list.size(-1) // batch_size if block_list is not None else 0) | |
| max_context_len = max_context_len * self.block_size | |
| if block_list is not None and batch_size > 0: | |
| # Compute number of blocks per sequence using ceiling division to handle partial blocks. | |
| blocks_per_seq = math.ceil(block_list.size(-1) / batch_size) | |
| max_context_len = blocks_per_seq * self.block_size | |
| else: | |
| max_context_len = 0 |
| (past_indices.unsqueeze(0).unsqueeze(0) > invalid_lens_t.unsqueeze(-1)) & | ||
| (past_indices.unsqueeze(0).unsqueeze(0) < context_lens_t.unsqueeze(-1).unsqueeze(-1))).unsqueeze(1) | ||
|
|
||
| causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift) |
There was a problem hiding this comment.
Indexing with [0] assumes which_chunk has at least one element. While this may be guaranteed by the context, the assumption is not immediately clear. Consider adding a comment explaining why the first element is used or add an assertion to document this assumption.
| causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift) | |
| causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=shift) | |
| # which_chunk is expected to have at least one row (batch dimension > 0) in this code path. | |
| assert which_chunk.size(0) > 0, "which_chunk is expected to have at least one row" |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
…#855 (#881) Cherry pick missing fixes: chunked attention fixes from #821 llama4 32k+ context window #855 --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Cherry-pick of vllm-project@6e1be4e but adapted to recent changes in vllm-project#526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
…ndow fix from vllm-project#855 (vllm-project#881) Cherry pick missing fixes: chunked attention fixes from vllm-project#821 llama4 32k+ context window vllm-project#855 --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
Cherry-pick of vllm-project@6e1be4e but adapted to recent changes in vllm-project#526 --------- Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>
Cherry-pick of
6e1be4e but adapted to recent changes in #526