cherry-pick chunked attention from #821 + 32k+ context window fix from #855#881
Conversation
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks two important fixes: chunked attention support from PR #821 and a fix for Llama4 models with 32k+ context windows from PR #855. The changes enable proper handling of chunked attention patterns and ensure correct attention metadata processing for models using attention chunking.
Changes:
- Added chunked attention support throughout the attention pipeline, including metadata handling and bias computation
- Fixed output tensor reshaping in fused MoE operations based on data parallel configuration
- Integrated chunked attention configuration detection and setup during model loading
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Core implementation of chunked attention support including metadata processing, bias calculation, and model configuration detection |
| vllm_gaudi/v1/spec_decode/hpu_eagle.py | Added chunked attention metadata fields to speculative decoding |
| vllm_gaudi/v1/attention/backends/hpu_attn.py | Updated attention metadata factory method to include chunked attention parameters |
| vllm_gaudi/ops/hpu_fused_moe.py | Fixed tensor reshaping logic for MoE operations with data parallelism |
| vllm_gaudi/attention/backends/hpu_attn.py | Added chunked attention metadata fields and selection logic in attention implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.model_has_chunked_attention = True | ||
| try: | ||
| for layer in model.language_model.model.layers: | ||
| if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__: |
There was a problem hiding this comment.
The string comparison 'in' check for class names is fragile and could match unintended class names. Consider using isinstance() or checking class.name with exact equality instead.
| if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__: | |
| backend = layer.self_attn.attn.get_attn_backend() | |
| backend_name = getattr(backend, "__name__", backend.__class__.__name__) | |
| if backend_name == "ChunkedLocalAttention": |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching and silently suppressing all exceptions with bare 'except Exception: pass' hides potential configuration or attribute errors. Consider logging the exception or catching more specific exception types.
| block_tables_chunk = [ | ||
| block_table[num_seq_chunks[i] * chunk_size_in_blocks:] | ||
| for i, block_table in enumerate(block_tables_list) | ||
| ] |
There was a problem hiding this comment.
There is duplicated logic between the chunked attention buffer generation (lines 2152-2164) and the similar pattern for window blocks. Consider extracting this into a helper method to reduce code duplication.
| if layer.dp_size > 1: | ||
| return output.view(*(output.size(0), *input_shape[1:])) | ||
| else: | ||
| return output.view(*input_shape) |
There was a problem hiding this comment.
The conditional reshaping logic based on dp_size lacks explanation. Add a comment explaining why different reshaping is needed when data parallelism is enabled versus disabled.
288b1a7 to
1631984
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
4fd940d to
40bdb59
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Due to MambaMixer2 implementation requirements, all buckets used for mamba must be a multiple of mamba chunk size. Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Luca Calabria <luca.calabria@intel.com>
1. #805 2. #837 3. #855 4. #862 --------- Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
1. #805 2. #837 3. #855 4. #862 --------- Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com>
bb5c13f to
0206d3f
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
…plus_context_fix Signed-off-by: Luca Calabria <luca.calabria@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
wpyszka
left a comment
There was a problem hiding this comment.
fix is approved for 0.14.1
…ndow fix from vllm-project#855 (vllm-project#881) Cherry pick missing fixes: chunked attention fixes from vllm-project#821 llama4 32k+ context window vllm-project#855 --------- Signed-off-by: Luca Calabria <luca.calabria@intel.com> Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Signed-off-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Cherry pick missing fixes:
chunked attention fixes from #821
llama4 32k+ context window #855