Skip to content

fix kernel block size, port of #1439#1453

Merged
kamil-kaczor merged 2 commits into
vllm-project:mainfrom
iboiko-habana:pr1439_port
May 19, 2026
Merged

fix kernel block size, port of #1439#1453
kamil-kaczor merged 2 commits into
vllm-project:mainfrom
iboiko-habana:pr1439_port

Conversation

@iboiko-habana
Copy link
Copy Markdown
Collaborator

@iboiko-habana iboiko-habana commented May 18, 2026

Port of #1439
16 is supported for testing/smaller models; 128 is the standard HPU
kernel block size; 528 is required for Granite 4.0-H
(granitemoehybrid) without prefix caching (16-token FA alignment),
768 with prefix caching (chunk-aligned).

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the V1 HPU attention backend’s supported kernel block sizes to include 16-token blocks while preserving the existing 128/528/768-token support used by standard HPU and Granite hybrid configurations.

Changes:

  • Adds 16 to HPUAttentionBackendV1.get_supported_kernel_block_sizes().
  • Updates the inline comment to describe the new smaller/testing block-size support.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@kamil-kaczor kamil-kaczor merged commit a331930 into vllm-project:main May 19, 2026
2 checks passed
iboiko-habana added a commit that referenced this pull request May 19, 2026
1) added in #1453
16 is supported for testing/smaller models; 128 is the standard HPU
kernel block size; 528 is required for Granite 4.0-H
(granitemoehybrid) without prefix caching (16-token FA alignment),
768 with prefix caching (chunk-aligned).

2) _patch_hf3fs_mock_client_for_cpu_only
Upstream mock client unconditionally calls
``torch.cuda.current_stream().wait_event(event)`` in ``batch_write``.
In environments where PyTorch is not compiled with CUDA, that path
throws
and the method returns ``-1`` for writes, causing connector unit tests
to
fail. This patch keeps the same behavior but skips CUDA synchronization
when
    CUDA is unavailable.

---------

Signed-off-by: Harish Subramony <harish.subramony@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
mgawarkiewicz-intel pushed a commit that referenced this pull request May 25, 2026
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants