Skip to content

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b]#1279

Merged
iboiko-habana merged 16 commits intovllm-project:mainfrom
iboiko-habana:hourly2903
Apr 7, 2026
Merged

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b]#1279
iboiko-habana merged 16 commits intovllm-project:mainfrom
iboiko-habana:hourly2903

Conversation

@iboiko-habana
Copy link
Copy Markdown
Collaborator

@iboiko-habana iboiko-habana commented Mar 31, 2026

  1. KV cache offloading (vllm#37853):
    cpu_hpu.py: Rewrite HPU monkey-patches for new CanonicalKVCaches API — updated get_handlers, SingleDirectionOffloadingHandler.init, CpuGpuOffloadingHandlers.init, and transfer_async. Added OffloadingConnectorWorker.register_kv_caches override to handle HPU's TensorTuple (K/V pair) KV cache layout.
    test_offloading_connector.py: Renamed to utils.py per upstream restructuring.
    offloading_connector: New test subdirectory with init.py, conftest.py, utils.py, test_scheduler.py, test_metrics.py matching upstream test split.

  2. GDN attention refactor (vllm#37975):
    qwen3_5.py: Updated monkey-patch target from removed Qwen3_5GatedDeltaNet to GatedDeltaNetAttention.

  3. DeepSeek OCR fix (vllm#35182):
    deepseek_ocr.py: Minor compatibility fix.

  4. Pooling metadata (vllm#38139):
    hpu_input_batch.py: Renamed _make_prompt_token_ids_tensor → _make_prompt_token_ids_cpu_tensor (returns CPU tensor); updated _make_sampling_metadata to create CPU tensor first then copy to device; added prompt_token_ids_cpu to get_pooling_metadata.
    hpu_model_runner.py: Added prompt_token_ids_cpu argument to PoolingMetadata constructors.

  5. LLM(**engine_args) test changes (vllm#37902):
    generation_mm.py — replace asdict(engine_args) + LLM(**engine_args) with LLM.from_engine_args(engine_args), remove unused asdict import
    generation_mm_multi.py — same change

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Gaudi plugin for compatibility with upstream vLLM changes (referenced in the PR description), primarily around KV offloading layout/metadata and some model import path adjustments.

Changes:

  • Refactors the CPU↔HPU KV offload handler wiring to consume CanonicalKVCaches / page-based KV cache metadata and allocates CPU KV buffers accordingly.
  • Updates model integration points for upstream API moves (Qwen3.5 GDN attention class path; Deepseek OCR MultiModalDataDict import path).
  • Restructures/expands KV offloading connector unit tests (moves scheduler tests into a dedicated file, adds metrics tests, adds a conftest.py fixture export, and adds async-scheduling parameterization).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
vllm_gaudi/v1/kv_offload/worker/cpu_hpu.py Refactors CPU↔HPU KV transfer handler to use canonical KV cache metadata and new buffer allocation/copy logic.
vllm_gaudi/models/qwen3_5.py Updates Qwen3.5 GatedDeltaNet integration to use the upstream GatedDeltaNetAttention symbol.
vllm_gaudi/models/deepseek_ocr.py Updates MultiModalDataDict import to the new upstream module path.
tests/unit_tests/kv_offload/offloading_connector/utils.py Test harness updates: new imports, async scheduling option, updated spec module path, and vLLM config context usage.
tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py New scheduler/offloading behavior tests (incl. async scheduling parameterization).
tests/unit_tests/kv_offload/offloading_connector/test_metrics.py New unit tests for offloading connector metrics/stats behavior.
tests/unit_tests/kv_offload/offloading_connector/conftest.py Exposes request_runner fixture for the new test module layout.
tests/unit_tests/kv_offload/offloading_connector/__init__.py Package marker for the test module directory.
Comments suppressed due to low confidence (1)

tests/unit_tests/kv_offload/offloading_connector/utils.py:25

  • These imports use parentheses with whitespace before the closing ) (e.g. OffloadingConnectorMetadata, )), which typically triggers pycodestyle/ruff E202. Please reformat these imports (or run the repo formatter) to avoid lint failures.

Comment on lines 185 to +197
with torch.hpu.stream(stream):
start_event.record(stream)
for src_tensor, dst_tensor, block_size_in_bytes in zip(
for src_tensor, dst_tensor in zip(
self.src_tensors,
self.dst_tensors,
self.block_size_in_bytes,
):
swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor, \
"d2h" if self.src_tensors[0].device.type == "hpu" else "h2d")
src_device_indices = src_indices.to(src_tensor.device)
dst_device_indices = dst_indices.to(dst_tensor.device)
target_device = dst_tensor.device.type
dst_tensor.index_put_(
(dst_device_indices, ),
src_tensor.index_select(0, src_device_indices).to(target_device),
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transfer_async copies blocks by index_put_ of whole rows, but CPU tensors are allocated with cpu_page_size_bytes = gpu_page_size_bytes * block_size_factor. When block_size_factor != 1, the source and destination row widths differ (GPU page size vs CPU page size), so this index_put_ will raise a shape mismatch at runtime and cannot correctly pack/unpack sub-blocks. Consider reshaping the CPU tensor to a sub-block view (e.g., (num_cpu_blocks, block_size_factor, gpu_page_size_bytes)), and performing transfers on the sub-block dimension so GPU<->CPU copies always use gpu_page_size_bytes-sized slices.

Copilot uses AI. Check for mistakes.
src_tensor.index_select(0, src_device_indices).to(target_device),
)

torch.hpu.synchronize()
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.hpu.synchronize() inside transfer_async defeats the purpose of async transfers: it blocks the host until all queued HPU work completes, and also makes the recorded end_event effectively already-complete. Prefer relying on stream ordering + end_event.record(stream) (and the consumer waiting on the event) rather than a global synchronize.

Suggested change
torch.hpu.synchronize()

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +128
# TODO(orozery): use kv_cache_data_ref.page_size_bytes
# once swap_blocks support it
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO mentions swap_blocks, but swap_blocks was removed from this module. Please update the comment to reflect the current transfer implementation (or remove it) to avoid misleading future changes (e.g., reference kv_cache_data_ref.page_size_bytes usage in the current index_put_/reshape-based approach).

Suggested change
# TODO(orozery): use kv_cache_data_ref.page_size_bytes
# once swap_blocks support it
# TODO(orozery): consider using kv_cache_data_ref.page_size_bytes
# to derive this size in the current index_put_/reshape-based transfer.

Copilot uses AI. Check for mistakes.
Comment thread vllm_gaudi/models/qwen3_5.py Outdated
import torch
import vllm.model_executor.models.qwen3_5 as qwen3_5_module
from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet
#from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out import #from ... violates PEP8/ruff E265 (block comments should start with # ) and leaves dead code behind. Either delete the commented import or format it as a proper comment with a space and an explanation of why it’s intentionally disabled.

Suggested change
#from vllm.model_executor.models.qwen3_5 import Qwen3_5GatedDeltaNet

Copilot uses AI. Check for mistakes.
import pytest

from tests.unit_tests.kv_offload.offloading_connector.utils import (
generate_store_output, )
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parenthesized imports include whitespace before the closing ) (e.g. generate_store_output, )), which typically triggers pycodestyle/ruff E202 (whitespace before ')'). Please reformat the import without the extra space (or let yapf/ruff format rewrite it).

Suggested change
generate_store_output, )
generate_store_output,
)

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
OffloadingConnectorStats, )


Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parenthesized imports include whitespace before the closing ) (e.g. OffloadingConnectorStats, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the repo formatter) to avoid lint failures.

Suggested change
OffloadingConnectorStats, )
OffloadingConnectorStats
)

Copilot uses AI. Check for mistakes.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from tests.unit_tests.kv_offload.offloading_connector.utils import (
request_runner, )
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parenthesized imports include whitespace before the closing ) (e.g. request_runner, )), which typically triggers pycodestyle/ruff E202. Please reformat the import (or run the formatter).

Suggested change
request_runner, )
request_runner,
)

Copilot uses AI. Check for mistakes.
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…r #38139

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…r qwen3_5

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
d28d86e8a34bf2617be294c235d6e6ef3321917b

adobrzyn added a commit that referenced this pull request Apr 7, 2026
…-pick PR #1279

Cherry-pick of #1279

Changes:
- KV cache offloading: rewrite HPU monkey-patches for new CanonicalKVCaches API
- GDN attention refactor: update monkey-patch target for Qwen3.5
- DeepSeek OCR: minor compatibility fix
- Pooling metadata: rename and fix tensor handling
- LLM(**engine_args) test changes: use LLM.from_engine_args
- Restructure offloading connector tests
- Add tblib dependency
- Update CI discoverable tests
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
d28d86e8a34bf2617be294c235d6e6ef3321917b

@iboiko-habana iboiko-habana merged commit 89217de into vllm-project:main Apr 7, 2026
92 of 141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants