[FIX_FOR_VLLM_CUSTOM=14acf429ac08b6d538ca6feb3e06b6d13895804d] Fix CPUOffloadingSpec import path and remove obsolete roberta patch#1229
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes breakages caused by recent upstream vLLM refactors by updating import paths and removing an obsolete RoBERTa monkey-patch.
Changes:
- Update
CPUOffloadingSpecimport to the newvllm.v1.kv_offload.cpu.specmodule location. - Remove the now-obsolete RoBERTa forward monkey-patch and stop importing it during model registration.
- Leave a short note in
roberta.pyexplaining why the patch was removed.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/kv_offload/worker/cpu_hpu.py | Switches to the new upstream import path for CPUOffloadingSpec. |
| vllm_gaudi/models/roberta.py | Removes the previous monkey-patch implementation and replaces it with an explanatory note. |
| vllm_gaudi/init.py | Stops importing the removed RoBERTa patch module during model registration. |
248cbdd to
7377dd7
Compare
| if token_type_ids is not None: | ||
| assert self.roberta.config.vocab_size < (1 << TOKEN_TYPE_SHIFT) | ||
| assert input_ids is not None | ||
|
|
There was a problem hiding this comment.
roberta.py was added in #1001. it was special handling of _encode_token_type_ids. after removal of roberta.py, _encode_token_type_ids(input_ids, token_type_ids) will be used from upstream forward function. let's wait for roberta models test's results
There was a problem hiding this comment.
I have similar concerns here. Let’s wait for the test results.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
12464ac to
016dcc3
Compare
…solete roberta patch - Update CPUOffloadingSpec import from vllm.v1.kv_offload.cpu to vllm.v1.kv_offload.cpu.spec (upstream PR #37874 refactored cpu.py into a cpu/ package) - Remove roberta monkey-patch that called the now-deleted replace_roberta_positions function (upstream PR #37884 moved the position offset adjustment into RobertaEmbedding.forward()) - Remove corresponding roberta import from register_models() Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
…e removed vllm_is_batch_invariant with envs.VLLM_BATCH_INVARIANT Upstream vLLM PR #35007 removed the vllm_is_batch_invariant() function from batch_invariant.py, replacing it with a direct envs read. Update vllm-gaudi to match. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: GitHub Copilot
…decode path against None key_cache During V1 warmup with LoRA or KV-offloading, the decode path can be called before KV caches are bound. flat_pa crashes with AttributeError on key_cache.shape when key_cache is None. Add a None check in the decode path of HPUAttentionImpl.forward to return zeros when key_cache is not available, matching the defensive pattern already used in the prompt path. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: GitHub Copilot
…_cache access after upstream list-to-element refactor Upstream vLLM commit c59a132f9 (#37487) changed Attention.kv_cache from a list of tensors to a single tensor. The HPU attention and MLA attention code accessed self.kv_cache[0] which now returns the first sub-tensor slice instead of the intended KV cache tensor, causing corrupted inference results. Fix: Replace self.kv_cache[0] with self.kv_cache in both affected files. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
016dcc3 to
ea814d0
Compare
…_cache indexing in Qwen3.5 GatedDeltaNet self.kv_cache is already a tuple (conv_state, ssm_state) assigned by the HPU model runner. The redundant intermediate index self.kv_cache[0][0/1] collapsed conv_state from 3-D to 2-D, causing an IndexError during Dynamo tracing. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
ea814d0 to
963be20
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Summary
Multiple upstream vLLM changes broke the hourly CI (RED since 2026-03-23 13:07 UTC):
Changes
Impact
Fixes ALL 50+ e2e test failures and restores correct model output on HPU.
AI-assisted: All changes reviewed and verified on HPU hardware.