Skip to content

cherry-pick chunked attention from #821 + 32k+ context window fix from #855#881

Merged
wpyszka merged 12 commits into
vllm-project:releases/v0.14.1from
Luca-Calabria:cherry_pick_chunked_attn_and_32kplus_context_fix
Jan 28, 2026
Merged

cherry-pick chunked attention from #821 + 32k+ context window fix from #855#881
wpyszka merged 12 commits into
vllm-project:releases/v0.14.1from
Luca-Calabria:cherry_pick_chunked_attn_and_32kplus_context_fix

Conversation

@Luca-Calabria
Copy link
Copy Markdown
Contributor

Cherry pick missing fixes:
chunked attention fixes from #821
llama4 32k+ context window #855

Copilot AI review requested due to automatic review settings January 26, 2026 22:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks two important fixes: chunked attention support from PR #821 and a fix for Llama4 models with 32k+ context windows from PR #855. The changes enable proper handling of chunked attention patterns and ensure correct attention metadata processing for models using attention chunking.

Changes:

  • Added chunked attention support throughout the attention pipeline, including metadata handling and bias computation
  • Fixed output tensor reshaping in fused MoE operations based on data parallel configuration
  • Integrated chunked attention configuration detection and setup during model loading

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Core implementation of chunked attention support including metadata processing, bias calculation, and model configuration detection
vllm_gaudi/v1/spec_decode/hpu_eagle.py Added chunked attention metadata fields to speculative decoding
vllm_gaudi/v1/attention/backends/hpu_attn.py Updated attention metadata factory method to include chunked attention parameters
vllm_gaudi/ops/hpu_fused_moe.py Fixed tensor reshaping logic for MoE operations with data parallelism
vllm_gaudi/attention/backends/hpu_attn.py Added chunked attention metadata fields and selection logic in attention implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.model_has_chunked_attention = True
try:
for layer in model.language_model.model.layers:
if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__:
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string comparison 'in' check for class names is fragile and could match unintended class names. Consider using isinstance() or checking class.name with exact equality instead.

Suggested change
if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__:
backend = layer.self_attn.attn.get_attn_backend()
backend_name = getattr(backend, "__name__", backend.__class__.__name__)
if backend_name == "ChunkedLocalAttention":

Copilot uses AI. Check for mistakes.
Comment on lines +1433 to +1434
except Exception:
pass
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching and silently suppressing all exceptions with bare 'except Exception: pass' hides potential configuration or attribute errors. Consider logging the exception or catching more specific exception types.

Copilot uses AI. Check for mistakes.
block_tables_chunk = [
block_table[num_seq_chunks[i] * chunk_size_in_blocks:]
for i, block_table in enumerate(block_tables_list)
]
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is duplicated logic between the chunked attention buffer generation (lines 2152-2164) and the similar pattern for window blocks. Consider extracting this into a helper method to reduce code duplication.

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +166
if layer.dp_size > 1:
return output.view(*(output.size(0), *input_shape[1:]))
else:
return output.view(*input_shape)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional reshaping logic based on dp_size lacks explanation. Add a comment explaining why different reshaping is needed when data parallelism is enabled versus disabled.

Copilot uses AI. Check for mistakes.
@Luca-Calabria Luca-Calabria force-pushed the cherry_pick_chunked_attn_and_32kplus_context_fix branch 2 times, most recently from 288b1a7 to 1631984 Compare January 26, 2026 23:49
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7de043d55d1dd629554467e23874097e1c48993

@Luca-Calabria Luca-Calabria force-pushed the cherry_pick_chunked_attn_and_32kplus_context_fix branch from 4fd940d to 40bdb59 Compare January 28, 2026 13:08
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions Bot and others added 11 commits January 28, 2026 14:19
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Due to MambaMixer2 implementation requirements, all buckets used for
mamba must be a multiple of mamba chunk size.

Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Reverts #780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
1. #805
2. #837
3. #855
4. #862

---------

Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
Signed-off-by: linoy buchnik <lbuchnik@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: Linoy Buchnik <linoybu@gmail.com>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
1. #805
2. #837
3. #855
4. #862

---------

Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
Signed-off-by: linoy buchnik <lbuchnik@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: Linoy Buchnik <linoybu@gmail.com>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
@Luca-Calabria Luca-Calabria force-pushed the cherry_pick_chunked_attn_and_32kplus_context_fix branch from bb5c13f to 0206d3f Compare January 28, 2026 13:19
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

…plus_context_fix

Signed-off-by: Luca Calabria <luca.calabria@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7de043d55d1dd629554467e23874097e1c48993

Copy link
Copy Markdown
Collaborator

@wpyszka wpyszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix is approved for 0.14.1

@wpyszka wpyszka merged commit 82b0e8a into vllm-project:releases/v0.14.1 Jan 28, 2026
53 checks passed
@Luca-Calabria Luca-Calabria deleted the cherry_pick_chunked_attn_and_32kplus_context_fix branch January 29, 2026 09:01
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Jan 29, 2026
…ndow fix from vllm-project#855 (vllm-project#881)

Cherry pick missing fixes:
chunked attention fixes from
vllm-project#821
llama4 32k+ context window
vllm-project#855

---------

Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Jakub Byczkowski <jbyczkowski@habana.ai>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
Signed-off-by: linoy buchnik <lbuchnik@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jakub Byczkowski <jbyczkowski@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Radosław Smyrek <radoslawx.smyrek@intel.com>
Co-authored-by: Linoy Buchnik <linoybu@gmail.com>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Luca-Calabria added a commit to Luca-Calabria/vllm-gaudi that referenced this pull request Feb 6, 2026
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
wpyszka pushed a commit that referenced this pull request Feb 9, 2026
Cherry pick Llama4 missing fixes from #881 #862 #884 on releases/ branch

Signed-off-by: Luca Calabria <luca.calabria@intel.com>
wpyszka added a commit that referenced this pull request Feb 9, 2026
Added Llama4 missing fixes from #881 #862 #884 on main branch

---------

Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
Added Llama4 missing fixes from #881 #862 #884 on main branch

---------

Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants