[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b]#1279
[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b]#1279iboiko-habana merged 16 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
There was a problem hiding this comment.
Pull request overview
This PR updates the Gaudi plugin for compatibility with upstream vLLM changes (referenced in the PR description), primarily around KV offloading layout/metadata and some model import path adjustments.
Changes:
- Refactors the CPU↔HPU KV offload handler wiring to consume
CanonicalKVCaches/ page-based KV cache metadata and allocates CPU KV buffers accordingly. - Updates model integration points for upstream API moves (Qwen3.5 GDN attention class path; Deepseek OCR
MultiModalDataDictimport path). - Restructures/expands KV offloading connector unit tests (moves scheduler tests into a dedicated file, adds metrics tests, adds a
conftest.pyfixture export, and adds async-scheduling parameterization).
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_gaudi/v1/kv_offload/worker/cpu_hpu.py |
Refactors CPU↔HPU KV transfer handler to use canonical KV cache metadata and new buffer allocation/copy logic. |
vllm_gaudi/models/qwen3_5.py |
Updates Qwen3.5 GatedDeltaNet integration to use the upstream GatedDeltaNetAttention symbol. |
vllm_gaudi/models/deepseek_ocr.py |
Updates MultiModalDataDict import to the new upstream module path. |
tests/unit_tests/kv_offload/offloading_connector/utils.py |
Test harness updates: new imports, async scheduling option, updated spec module path, and vLLM config context usage. |
tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py |
New scheduler/offloading behavior tests (incl. async scheduling parameterization). |
tests/unit_tests/kv_offload/offloading_connector/test_metrics.py |
New unit tests for offloading connector metrics/stats behavior. |
tests/unit_tests/kv_offload/offloading_connector/conftest.py |
Exposes request_runner fixture for the new test module layout. |
tests/unit_tests/kv_offload/offloading_connector/__init__.py |
Package marker for the test module directory. |
Comments suppressed due to low confidence (1)
tests/unit_tests/kv_offload/offloading_connector/utils.py:25
- These imports use parentheses with whitespace before the closing
)(e.g.OffloadingConnectorMetadata, )), which typically triggers pycodestyle/ruff E202. Please reformat these imports (or run the repo formatter) to avoid lint failures.
| with torch.hpu.stream(stream): | ||
| start_event.record(stream) | ||
| for src_tensor, dst_tensor, block_size_in_bytes in zip( | ||
| for src_tensor, dst_tensor in zip( | ||
| self.src_tensors, | ||
| self.dst_tensors, | ||
| self.block_size_in_bytes, | ||
| ): | ||
| swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor, \ | ||
| "d2h" if self.src_tensors[0].device.type == "hpu" else "h2d") | ||
| src_device_indices = src_indices.to(src_tensor.device) | ||
| dst_device_indices = dst_indices.to(dst_tensor.device) | ||
| target_device = dst_tensor.device.type | ||
| dst_tensor.index_put_( | ||
| (dst_device_indices, ), | ||
| src_tensor.index_select(0, src_device_indices).to(target_device), | ||
| ) |
There was a problem hiding this comment.
transfer_async copies blocks by index_put_ of whole rows, but CPU tensors are allocated with cpu_page_size_bytes = gpu_page_size_bytes * block_size_factor. When block_size_factor != 1, the source and destination row widths differ (GPU page size vs CPU page size), so this index_put_ will raise a shape mismatch at runtime and cannot correctly pack/unpack sub-blocks. Consider reshaping the CPU tensor to a sub-block view (e.g., (num_cpu_blocks, block_size_factor, gpu_page_size_bytes)), and performing transfers on the sub-block dimension so GPU<->CPU copies always use gpu_page_size_bytes-sized slices.
| src_tensor.index_select(0, src_device_indices).to(target_device), | ||
| ) | ||
|
|
||
| torch.hpu.synchronize() |
There was a problem hiding this comment.
torch.hpu.synchronize() inside transfer_async defeats the purpose of async transfers: it blocks the host until all queued HPU work completes, and also makes the recorded end_event effectively already-complete. Prefer relying on stream ordering + end_event.record(stream) (and the consumer waiting on the event) rather than a global synchronize.
| torch.hpu.synchronize() |
| # TODO(orozery): use kv_cache_data_ref.page_size_bytes | ||
| # once swap_blocks support it |
There was a problem hiding this comment.
The TODO mentions swap_blocks, but swap_blocks was removed from this module. Please update the comment to reflect the current transfer implementation (or remove it) to avoid misleading future changes (e.g., reference kv_cache_data_ref.page_size_bytes usage in the current index_put_/reshape-based approach).
| # TODO(orozery): use kv_cache_data_ref.page_size_bytes | |
| # once swap_blocks support it | |
| # TODO(orozery): consider using kv_cache_data_ref.page_size_bytes | |
| # to derive this size in the current index_put_/reshape-based transfer. |
| import torch | ||
| import vllm.model_executor.models.qwen3_5 as qwen3_5_module | ||
| from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet | ||
| #from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet |
There was a problem hiding this comment.
The commented-out import #from ... violates PEP8/ruff E265 (block comments should start with # ) and leaves dead code behind. Either delete the commented import or format it as a proper comment with a space and an explanation of why it’s intentionally disabled.
| #from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet |
| import pytest | ||
|
|
||
| from tests.unit_tests.kv_offload.offloading_connector.utils import ( | ||
| generate_store_output, ) |
There was a problem hiding this comment.
These parenthesized imports include whitespace before the closing ) (e.g. generate_store_output, )), which typically triggers pycodestyle/ruff E202 (whitespace before ')'). Please reformat the import without the extra space (or let yapf/ruff format rewrite it).
| generate_store_output, ) | |
| generate_store_output, | |
| ) |
| OffloadingConnectorStats, ) | ||
|
|
||
|
|
There was a problem hiding this comment.
These parenthesized imports include whitespace before the closing ) (e.g. OffloadingConnectorStats, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the repo formatter) to avoid lint failures.
| OffloadingConnectorStats, ) | |
| OffloadingConnectorStats | |
| ) |
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| from tests.unit_tests.kv_offload.offloading_connector.utils import ( | ||
| request_runner, ) |
There was a problem hiding this comment.
These parenthesized imports include whitespace before the closing ) (e.g. request_runner, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the formatter).
| request_runner, ) | |
| request_runner, | |
| ) |
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…r #38139 Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…r qwen3_5 Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Iryna Boiko <iryna.boiko@intel.com>
…-pick PR #1279 Cherry-pick of #1279 Changes: - KV cache offloading: rewrite HPU monkey-patches for new CanonicalKVCaches API - GDN attention refactor: update monkey-patch target for Qwen3.5 - DeepSeek OCR: minor compatibility fix - Pooling metadata: rename and fix tensor handling - LLM(**engine_args) test changes: use LLM.from_engine_args - Restructure offloading connector tests - Add tblib dependency - Update CI discoverable tests
✅ CI PassedAll checks passed successfully against the following vllm commit: |
KV cache offloading (vllm#37853):
cpu_hpu.py: Rewrite HPU monkey-patches for new CanonicalKVCaches API — updated get_handlers, SingleDirectionOffloadingHandler.init, CpuGpuOffloadingHandlers.init, and transfer_async. Added OffloadingConnectorWorker.register_kv_caches override to handle HPU's TensorTuple (K/V pair) KV cache layout.
test_offloading_connector.py: Renamed to utils.py per upstream restructuring.
offloading_connector: New test subdirectory with init.py, conftest.py, utils.py, test_scheduler.py, test_metrics.py matching upstream test split.
GDN attention refactor (vllm#37975):
qwen3_5.py: Updated monkey-patch target from removed Qwen3_5GatedDeltaNet to GatedDeltaNetAttention.
DeepSeek OCR fix (vllm#35182):
deepseek_ocr.py: Minor compatibility fix.
Pooling metadata (vllm#38139):
hpu_input_batch.py: Renamed _make_prompt_token_ids_tensor → _make_prompt_token_ids_cpu_tensor (returns CPU tensor); updated _make_sampling_metadata to create CPU tensor first then copy to device; added prompt_token_ids_cpu to get_pooling_metadata.
hpu_model_runner.py: Added prompt_token_ids_cpu argument to PoolingMetadata constructors.
LLM(**engine_args) test changes (vllm#37902):
generation_mm.py — replace asdict(engine_args) + LLM(**engine_args) with LLM.from_engine_args(engine_args), remove unused asdict import
generation_mm_multi.py — same change