[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b] by iboiko-habana · Pull Request #1279 · vllm-project/vllm-gaudi

iboiko-habana · 2026-03-31T10:06:33Z

KV cache offloading (vllm#37853):
cpu_hpu.py: Rewrite HPU monkey-patches for new CanonicalKVCaches API — updated get_handlers, SingleDirectionOffloadingHandler.init, CpuGpuOffloadingHandlers.init, and transfer_async. Added OffloadingConnectorWorker.register_kv_caches override to handle HPU's TensorTuple (K/V pair) KV cache layout.
test_offloading_connector.py: Renamed to utils.py per upstream restructuring.
offloading_connector: New test subdirectory with init.py, conftest.py, utils.py, test_scheduler.py, test_metrics.py matching upstream test split.
GDN attention refactor (vllm#37975):
qwen3_5.py: Updated monkey-patch target from removed Qwen3_5GatedDeltaNet to GatedDeltaNetAttention.
DeepSeek OCR fix (vllm#35182):
deepseek_ocr.py: Minor compatibility fix.
Pooling metadata (vllm#38139):
hpu_input_batch.py: Renamed _make_prompt_token_ids_tensor → _make_prompt_token_ids_cpu_tensor (returns CPU tensor); updated _make_sampling_metadata to create CPU tensor first then copy to device; added prompt_token_ids_cpu to get_pooling_metadata.
hpu_model_runner.py: Added prompt_token_ids_cpu argument to PoolingMetadata constructors.
LLM(**engine_args) test changes (vllm#37902):
generation_mm.py — replace asdict(engine_args) + LLM(**engine_args) with LLM.from_engine_args(engine_args), remove unused asdict import
generation_mm_multi.py — same change

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Copilot

Pull request overview

This PR updates the Gaudi plugin for compatibility with upstream vLLM changes (referenced in the PR description), primarily around KV offloading layout/metadata and some model import path adjustments.

Changes:

Refactors the CPU↔HPU KV offload handler wiring to consume CanonicalKVCaches / page-based KV cache metadata and allocates CPU KV buffers accordingly.
Updates model integration points for upstream API moves (Qwen3.5 GDN attention class path; Deepseek OCR MultiModalDataDict import path).
Restructures/expands KV offloading connector unit tests (moves scheduler tests into a dedicated file, adds metrics tests, adds a conftest.py fixture export, and adds async-scheduling parameterization).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`vllm_gaudi/v1/kv_offload/worker/cpu_hpu.py`	Refactors CPU↔HPU KV transfer handler to use canonical KV cache metadata and new buffer allocation/copy logic.
`vllm_gaudi/models/qwen3_5.py`	Updates Qwen3.5 GatedDeltaNet integration to use the upstream `GatedDeltaNetAttention` symbol.
`vllm_gaudi/models/deepseek_ocr.py`	Updates `MultiModalDataDict` import to the new upstream module path.
`tests/unit_tests/kv_offload/offloading_connector/utils.py`	Test harness updates: new imports, async scheduling option, updated spec module path, and vLLM config context usage.
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`	New scheduler/offloading behavior tests (incl. async scheduling parameterization).
`tests/unit_tests/kv_offload/offloading_connector/test_metrics.py`	New unit tests for offloading connector metrics/stats behavior.
`tests/unit_tests/kv_offload/offloading_connector/conftest.py`	Exposes `request_runner` fixture for the new test module layout.
`tests/unit_tests/kv_offload/offloading_connector/__init__.py`	Package marker for the test module directory.

Comments suppressed due to low confidence (1)

tests/unit_tests/kv_offload/offloading_connector/utils.py:25

These imports use parentheses with whitespace before the closing ) (e.g. OffloadingConnectorMetadata, )), which typically triggers pycodestyle/ruff E202. Please reformat these imports (or run the repo formatter) to avoid lint failures.

Copilot · 2026-03-31T10:20:58Z

    with torch.hpu.stream(stream):
        start_event.record(stream)
-        for src_tensor, dst_tensor, block_size_in_bytes in zip(
+        for src_tensor, dst_tensor in zip(
                self.src_tensors,
                self.dst_tensors,
-                self.block_size_in_bytes,
        ):
-            swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor, \
-                        "d2h" if self.src_tensors[0].device.type == "hpu" else "h2d")
+            src_device_indices = src_indices.to(src_tensor.device)
+            dst_device_indices = dst_indices.to(dst_tensor.device)
+            target_device = dst_tensor.device.type
+            dst_tensor.index_put_(
+                (dst_device_indices, ),
+                src_tensor.index_select(0, src_device_indices).to(target_device),
+            )


transfer_async copies blocks by index_put_ of whole rows, but CPU tensors are allocated with cpu_page_size_bytes = gpu_page_size_bytes * block_size_factor. When block_size_factor != 1, the source and destination row widths differ (GPU page size vs CPU page size), so this index_put_ will raise a shape mismatch at runtime and cannot correctly pack/unpack sub-blocks. Consider reshaping the CPU tensor to a sub-block view (e.g., (num_cpu_blocks, block_size_factor, gpu_page_size_bytes)), and performing transfers on the sub-block dimension so GPU<->CPU copies always use gpu_page_size_bytes-sized slices.

Copilot · 2026-03-31T10:20:58Z

+                src_tensor.index_select(0, src_device_indices).to(target_device),
+            )
+
+        torch.hpu.synchronize()


torch.hpu.synchronize() inside transfer_async defeats the purpose of async transfers: it blocks the host until all queued HPU work completes, and also makes the recorded end_event effectively already-complete. Prefer relying on stream ordering + end_event.record(stream) (and the consumer waiting on the event) rather than a global synchronize.

Suggested change

torch.hpu.synchronize()

Copilot · 2026-03-31T10:20:58Z

+            # TODO(orozery): use kv_cache_data_ref.page_size_bytes
+            # once swap_blocks support it


The TODO mentions swap_blocks, but swap_blocks was removed from this module. Please update the comment to reflect the current transfer implementation (or remove it) to avoid misleading future changes (e.g., reference kv_cache_data_ref.page_size_bytes usage in the current index_put_/reshape-based approach).

Suggested change

# TODO(orozery): use kv_cache_data_ref.page_size_bytes

# once swap_blocks support it

# TODO(orozery): consider using kv_cache_data_ref.page_size_bytes

# to derive this size in the current index_put_/reshape-based transfer.

Copilot · 2026-03-31T10:20:59Z

 import torch
 import vllm.model_executor.models.qwen3_5 as qwen3_5_module
-from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet
+#from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet


The commented-out import #from ... violates PEP8/ruff E265 (block comments should start with # ) and leaves dead code behind. Either delete the commented import or format it as a proper comment with a space and an explanation of why it’s intentionally disabled.

Suggested change

#from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet

Copilot · 2026-03-31T10:20:59Z

+import pytest
+
+from tests.unit_tests.kv_offload.offloading_connector.utils import (
+    generate_store_output, )


These parenthesized imports include whitespace before the closing ) (e.g. generate_store_output, )), which typically triggers pycodestyle/ruff E202 (whitespace before ')'). Please reformat the import without the extra space (or let yapf/ruff format rewrite it).

Suggested change

generate_store_output, )

generate_store_output,

)

Copilot · 2026-03-31T10:20:59Z

+    OffloadingConnectorStats, )
+
+


These parenthesized imports include whitespace before the closing ) (e.g. OffloadingConnectorStats, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the repo formatter) to avoid lint failures.

Suggested change

OffloadingConnectorStats, )

OffloadingConnectorStats

)

Copilot · 2026-03-31T10:21:00Z

+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+from tests.unit_tests.kv_offload.offloading_connector.utils import (
+    request_runner, )


These parenthesized imports include whitespace before the closing ) (e.g. request_runner, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the formatter).

Suggested change

request_runner, )

request_runner,

)

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

…r #38139 Signed-off-by: Iryna Boiko <iboiko@habana.ai>

…r qwen3_5 Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

github-actions · 2026-04-01T10:37:05Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

github-actions · 2026-04-02T08:29:55Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

github-actions · 2026-04-02T12:57:56Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-04-03T10:50:31Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d28d86e8a34bf2617be294c235d6e6ef3321917b

Signed-off-by: Iryna Boiko <iryna.boiko@intel.com>

…-pick PR #1279 Cherry-pick of #1279 Changes: - KV cache offloading: rewrite HPU monkey-patches for new CanonicalKVCaches API - GDN attention refactor: update monkey-patch target for Qwen3.5 - DeepSeek OCR: minor compatibility fix - Pooling metadata: rename and fix tensor handling - LLM(**engine_args) test changes: use LLM.from_engine_args - Restructure offloading connector tests - Add tblib dependency - Update CI discoverable tests

github-actions · 2026-04-07T11:13:22Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d28d86e8a34bf2617be294c235d6e6ef3321917b

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b]

c4be0bf

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Copilot AI review requested due to automatic review settings March 31, 2026 10:06

iboiko-habana requested review from PatrykWo, adobrzyn, afierka-intel, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners March 31, 2026 10:06

Copilot started reviewing on behalf of iboiko-habana March 31, 2026 10:08 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

github-actions Bot mentioned this pull request Mar 31, 2026

🚦 Team Review Dashboard #701

Open

iboiko-habana added 5 commits March 31, 2026 15:24

fix offloading scenario

e9120ab

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

fix offloading scenario

5606a33

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b] fix fo…

1a4e6d9

…r #38139 Signed-off-by: Iryna Boiko <iboiko@habana.ai>

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b] fix fo…

72b2013

…r qwen3_5 Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Fix for #37902

067920a

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

iboiko-habana and others added 3 commits April 1, 2026 12:41

Merge branch 'main' into hourly2903

32c8ce1

Merge branch 'main' into hourly2903

9b68b78

Adding tblib and disablement of qwen35

4c01a9e

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Merge branch 'main' into hourly2903

1267239

pawel-olejniczak approved these changes Apr 2, 2026

View reviewed changes

fix for offloading tests

4182e63

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Merge branch 'main' into hourly2903

504a153

adobrzyn approved these changes Apr 3, 2026

View reviewed changes

iboiko-habana added 4 commits April 3, 2026 14:57

Merge branch 'main' into hourly2903

bca5d02

Signed-off-by: Iryna Boiko <iryna.boiko@intel.com>

Update qwen3_5.py - after rebase

908b12b

Update ci_e2e_discoverable_tests.sh

3aaa9e4

Update ci_e2e_discoverable_tests.sh

89fc66c

iboiko-habana merged commit 89217de into vllm-project:main Apr 7, 2026
92 of 141 checks passed

		# TODO(orozery): use kv_cache_data_ref.page_size_bytes
		# once swap_blocks support it

Conversation

iboiko-habana commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 1, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 2, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 2, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 3, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Apr 7, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iboiko-habana commented Mar 31, 2026 •

edited

Loading