Add support for chunked attention (#597)#809
Conversation
Cherry-pick of vllm-project@6e1be4e --------- Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai> Signed-off-by: Jan Kaniecki <jan.kaniecki@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds support for chunked attention to the vLLM-Gaudi implementation by cherry-picking a commit from the upstream repository. Chunked attention divides sequences into chunks and applies attention mechanisms within and across chunks, which can improve memory efficiency and performance for long sequences.
Changes:
- Added chunked attention bias calculation and block mapping methods
- Extended attention metadata structures to include chunked attention fields
- Integrated chunked attention support into the model execution pipeline
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Core implementation of chunked attention logic including bias calculation, block mapping, metadata updates, and model initialization |
| vllm_gaudi/v1/attention/backends/hpu_attn.py | Updated attention metadata factory method to accept chunked attention parameters |
| vllm_gaudi/attention/backends/hpu_attn.py | Extended metadata dataclass with chunked attention fields and integrated chunked attention into forward pass |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.model_has_chunked_attention = True | ||
| try: | ||
| for layer in model.language_model.model.layers: | ||
| if "ChunkedLocalAttention" in layer.self_attn.attn.get_attn_backend().__name__: |
There was a problem hiding this comment.
Using a bare except Exception: without logging or handling the specific exception (lines 1613-1614) silently suppresses all errors. Consider logging the exception or catching specific exception types to aid debugging and avoid masking unexpected failures.
| self.scheduler_output: SchedulerOutput | None = None | ||
| self.warmup_mode: bool = False | ||
| self.batch_changed: bool = False | ||
| # WA for chunked attention support |
There was a problem hiding this comment.
The comment abbreviation 'WA' is unclear. Consider expanding to 'Workaround' or providing more context about why chunked attention requires special handling.
| # WA for chunked attention support | |
| # Workaround flag for chunked attention support; toggled when special handling is required |
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
Signed-off-by: Katarzyna Fojcik <kfojcik@habana.ai>
|
run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
620600d
into
vllm-project:releases/v0.13.0
Cherry-pick of
6e1be4e