[FIX_FOR_VLLM_CUSTOM=f976e3b98ba45677a2213673a442c6cbff141e8e] Fix upstream regressions in attention, FP8, offloading and platform#1338
Conversation
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
There was a problem hiding this comment.
Pull request overview
Fixes multiple breakages caused by upstream vLLM API changes, keeping Gaudi/HPU attention, FP8, KV offloading tests, and platform integration compatible with the new interfaces.
Changes:
- Update HPU attention (regular + MLA) to align with upstream removal of
use_output/accept_output_buffer. - Register an HPU FP8 block-scaled kernel stub and add an ops test
conftestproviding a minimalVllmConfigcontext. - Update KV offloading connector unit tests for upstream event/output field renames and config layout changes.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_gaudi/platform.py |
Adds manual_seed_all required by upstream Platform API. |
vllm_gaudi/ops/hpu_fp8.py |
Registers an OOT entry for _POSSIBLE_FP8_BLOCK_KERNELS via an HPU stub kernel class. |
vllm_gaudi/ops/hpu_attention.py |
Removes dependency on upstream-removed use_output attribute in attention patching logic. |
vllm_gaudi/attention/oot_mla.py |
Removes accept_output_buffer branching and standardizes output-buffer usage in the opaque path. |
tests/unit_tests/ops/conftest.py |
Introduces a fixture that sets a minimal current VllmConfig for ops unit tests. |
tests/unit_tests/kv_offload/offloading_connector/utils.py |
Updates scheduler config assertions and adapts mocked PrepareStoreOutput to renamed fields. |
tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py |
Updates offloading event tests to use OffloadKey/make_offload_key and new event fields. |
| def generate_store_output(block_hashes: Iterable[BlockHash]): | ||
| block_hashes = list(block_hashes) | ||
| return PrepareStoreOutput( | ||
| block_hashes_to_store=list(block_hashes), | ||
| keys_to_store=list(block_hashes), | ||
| store_spec=MockLoadStoreSpec(block_hashes), | ||
| block_hashes_evicted=[], | ||
| evicted_keys=[], | ||
| ) |
There was a problem hiding this comment.
generate_store_output and MockLoadStoreSpec still use block_hashes naming, but the returned PrepareStoreOutput now uses the renamed keys_to_store / evicted_keys fields. Consider renaming the function parameter/local variables (and related mock spec fields, if appropriate) to keys to match the updated API and avoid confusion when reading or extending these tests.
| @@ -121,9 +100,6 @@ def forward_impl( | |||
| output_scale: torch.Tensor | None = None, | |||
| output_block_scale: torch.Tensor | None = None, | |||
| ) -> torch.Tensor: | |||
There was a problem hiding this comment.
forward_impl still accepts output / output_scale / output_block_scale, but the implementation ignores these parameters and will overwrite output locally. Since the earlier explicit NotImplementedError guard was removed, callers that pass an output buffer could now get silently incorrect behavior. Consider either (a) restoring an explicit error when output is provided, or (b) implementing true output-buffer support (writing into the provided tensor) and documenting the contract.
| ) -> torch.Tensor: | |
| ) -> torch.Tensor: | |
| if (output is not None or output_scale is not None | |
| or output_block_scale is not None): | |
| raise NotImplementedError( | |
| "HPUMLAAttention.forward_impl does not support caller-" | |
| "provided output, output_scale, or output_block_scale.") |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Summary
Fixes five regressions introduced by recent upstream vLLM changes that break HPU unit tests and model execution.
Changes
use_outputguard from HPU attention patch — attribute removed upstreamaccept_output_bufferbranching from HPU MLA attention — attribute removed upstream; unconditionally use output buffer in opaque path, direct call path manages output internallyblock_hashes→keys,block_hashes_to_store→keys_to_store, config access viakv_group_configs[0]_POSSIBLE_FP8_BLOCK_KERNELSdict needs OOT entry; provideVllmConfigstub for ops unit testsmanual_seed_alltoHpuPlatform— new required platform method for RNG seedingUpstream PRs that introduced these regressions
accept_output_bufferanduse_outputfrom attention layer (fixes 1, 2)OffloadingConnectorSchedulerAPI (fix 3)model_config.dtypeaccess inFp8LinearMethod.__init__and_POSSIBLE_FP8_BLOCK_KERNELS(fix 4)manual_seed_allas required abstract method onPlatform(fix 5)