cherry-pick chunked attention from #821 + 32k+ context window fix from #855#870
cherry-pick chunked attention from #821 + 32k+ context window fix from #855#870Luca-Calabria wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds support for chunked attention by cherry-picking fixes from PR #821. The changes implement the infrastructure needed to handle models that use chunked attention patterns, ensuring proper attention bias computation and block mapping for both prefill and decode phases.
Changes:
- Added chunked attention detection and initialization logic
- Extended attention metadata structures with chunked-specific fields
- Implemented chunked attention bias computation for prefill and decode phases
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Core implementation of chunked attention support including model detection, metadata processing, attention bias computation, and block mapping |
| vllm_gaudi/v1/spec_decode/hpu_eagle.py | Added chunked attention metadata fields to speculative decoding |
| vllm_gaudi/v1/attention/backends/hpu_attn.py | Updated attention metadata creation with chunked attention parameters |
| vllm_gaudi/attention/backends/hpu_attn.py | Extended attention metadata dataclass and added chunked attention handling in forward pass |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Luca Calabria <luca.calabria@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Cherry pick missing fixes:
chunked attention fixes from #821
llama4 32k+ context window #855