[FIX_FOR_VLLM_CUSTOM=f976e3b98ba45677a2213673a442c6cbff141e8e] Fix upstream regressions in attention, FP8, offloading and platform by pawel-olejniczak · Pull Request #1338 · vllm-project/vllm-gaudi

pawel-olejniczak · 2026-04-10T17:42:07Z

Summary

Fixes five regressions introduced by recent upstream vLLM changes that break HPU unit tests and model execution.

Changes

Remove use_output guard from HPU attention patch — attribute removed upstream
Remove accept_output_buffer branching from HPU MLA attention — attribute removed upstream; unconditionally use output buffer in opaque path, direct call path manages output internally
Update KV offloading connector tests — field renames: block_hashes → keys, block_hashes_to_store → keys_to_store, config access via kv_group_configs[0]
Register HPU FP8 block-scaled kernel + add ops test conftest — new _POSSIBLE_FP8_BLOCK_KERNELS dict needs OOT entry; provide VllmConfig stub for ops unit tests
Add manual_seed_all to HpuPlatform — new required platform method for RNG seeding

Upstream PRs that introduced these regressions

[Attention][V0 Deprecation] Deprecate accept output buffer vllm#39125 — removed accept_output_buffer and use_output from attention layer (fixes 1, 2)
[kv_offload+HMA][5/N]: Track group block hashes and block IDs vllm#37109 — restructured OffloadingConnectorScheduler API (fix 3)
[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. vllm#33892 — added model_config.dtype access in Fp8LinearMethod.__init__ and _POSSIBLE_FP8_BLOCK_KERNELS (fix 4)
Add platform manual_seed_all API vllm#38468 — added manual_seed_all as required abstract method on Platform (fix 5)

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Copilot

Pull request overview

Fixes multiple breakages caused by upstream vLLM API changes, keeping Gaudi/HPU attention, FP8, KV offloading tests, and platform integration compatible with the new interfaces.

Changes:

Update HPU attention (regular + MLA) to align with upstream removal of use_output / accept_output_buffer.
Register an HPU FP8 block-scaled kernel stub and add an ops test conftest providing a minimal VllmConfig context.
Update KV offloading connector unit tests for upstream event/output field renames and config layout changes.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`vllm_gaudi/platform.py`	Adds `manual_seed_all` required by upstream `Platform` API.
`vllm_gaudi/ops/hpu_fp8.py`	Registers an OOT entry for `_POSSIBLE_FP8_BLOCK_KERNELS` via an HPU stub kernel class.
`vllm_gaudi/ops/hpu_attention.py`	Removes dependency on upstream-removed `use_output` attribute in attention patching logic.
`vllm_gaudi/attention/oot_mla.py`	Removes `accept_output_buffer` branching and standardizes output-buffer usage in the opaque path.
`tests/unit_tests/ops/conftest.py`	Introduces a fixture that sets a minimal current `VllmConfig` for ops unit tests.
`tests/unit_tests/kv_offload/offloading_connector/utils.py`	Updates scheduler config assertions and adapts mocked `PrepareStoreOutput` to renamed fields.
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`	Updates offloading event tests to use `OffloadKey`/`make_offload_key` and new event fields.

Copilot · 2026-04-10T17:46:40Z

 def generate_store_output(block_hashes: Iterable[BlockHash]):
    block_hashes = list(block_hashes)
    return PrepareStoreOutput(
-        block_hashes_to_store=list(block_hashes),
+        keys_to_store=list(block_hashes),
        store_spec=MockLoadStoreSpec(block_hashes),
-        block_hashes_evicted=[],
+        evicted_keys=[],
    )


generate_store_output and MockLoadStoreSpec still use block_hashes naming, but the returned PrepareStoreOutput now uses the renamed keys_to_store / evicted_keys fields. Consider renaming the function parameter/local variables (and related mock spec fields, if appropriate) to keys to match the updated API and avoid confusion when reading or extending these tests.

Copilot · 2026-04-10T17:46:41Z

@@ -121,9 +100,6 @@ def forward_impl(
        output_scale: torch.Tensor | None = None,
        output_block_scale: torch.Tensor | None = None,
    ) -> torch.Tensor:


forward_impl still accepts output / output_scale / output_block_scale, but the implementation ignores these parameters and will overwrite output locally. Since the earlier explicit NotImplementedError guard was removed, callers that pass an output buffer could now get silently incorrect behavior. Consider either (a) restoring an explicit error when output is provided, or (b) implementing true output-buffer support (writing into the provided tensor) and documenting the contract.

Suggested change

) -> torch.Tensor:

) -> torch.Tensor:

if (output is not None or output_scale is not None

or output_block_scale is not None):

raise NotImplementedError(

"HPUMLAAttention.forward_impl does not support caller-"

"provided output, output_scale, or output_block_scale.")

github-actions · 2026-04-10T21:03:56Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
f976e3b98ba45677a2213673a442c6cbff141e8e

pawel-olejniczak added 5 commits April 10, 2026 11:59

Fix HPU attention forward guard for removed use_output attribute

adb7220

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Fix offloading connector tests for upstream API changes

c2ae5a6

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Fix FP8 block kernel registration and ops test config

b4cc45a

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Fix HPU MLA attention for removed accept_output_buffer attribute

0456f22

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Add manual_seed_all to HpuPlatform

ae16ecb

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Copilot AI review requested due to automatic review settings April 10, 2026 17:42

pawel-olejniczak requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners April 10, 2026 17:42

Copilot started reviewing on behalf of pawel-olejniczak April 10, 2026 17:42 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 10, 2026

🚦 Team Review Dashboard #701

Open

iboiko-habana approved these changes Apr 13, 2026

View reviewed changes

iboiko-habana merged commit c7b510f into vllm-project:main Apr 13, 2026
74 of 75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX_FOR_VLLM_CUSTOM=f976e3b98ba45677a2213673a442c6cbff141e8e] Fix upstream regressions in attention, FP8, offloading and platform#1338

[FIX_FOR_VLLM_CUSTOM=f976e3b98ba45677a2213673a442c6cbff141e8e] Fix upstream regressions in attention, FP8, offloading and platform#1338
iboiko-habana merged 5 commits intovllm-project:mainfrom
pawel-olejniczak:fix/vllm-hourly-10-4

pawel-olejniczak commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    ) -> torch.Tensor:
+    ) -> torch.Tensor:
+        if (output is not None or output_scale is not None
+                or output_block_scale is not None):
+            raise NotImplementedError(
+                "HPUMLAAttention.forward_impl does not support caller-"
+                "provided output, output_scale, or output_block_scale.")

Conversation

pawel-olejniczak commented Apr 10, 2026

Summary

Changes

Upstream PRs that introduced these regressions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 10, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants