[FIX_FOR_VLLM_CUSTOM=5b39b268f506150dbab38f6f6c04b7c843e37c07] Fix upstream regressions: MoE refactor, DeepSeek V4 router, KV offload HMA#1403
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes multiple vLLM upstream API/regression changes (MoE runner refactor, DeepSeek V4 router args, KV offload scheduler refactor) to restore vllm-gaudi unit test compatibility at vLLM commit 5b39b268f506150dbab38f6f6c04b7c843e37c07.
Changes:
- Update HPU MoE integration for upstream runner/router API changes (including
hash_indices_tableand router selection behavior). - Migrate MoE-related unit tests from removed
_forward_dispatchtorunner.forward()with a realForwardContext. - Sync KV offload test utilities with upstream multi-group/async scheduling + per-job completion tracking changes.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_gaudi/ops/hpu_lora.py |
Adapts LoRA registration/patching for upstream _all_lora_classes container change. |
vllm_gaudi/ops/hpu_fused_moe.py |
Updates MoE runner imports and extends router factory to accept hash_indices_table for DeepSeek V4. |
vllm_gaudi/ops/hpu_compressed_tensors.py |
Loosens apply_monolithic signature to accept new upstream kwargs (e.g., input_ids). |
tests/unit_tests/ops/utils.py |
Updates create_fused_moe() helper to pass new MoE ctor args. |
tests/unit_tests/ops/test_hpu_fused_moe.py |
Updates MoE test to use ForwardContext + runner.forward(). |
tests/unit_tests/ops/test_hpu_compressed_tensors.py |
Updates MoE test similarly to use ForwardContext + runner.forward(). |
tests/unit_tests/kv_offload/utils.py |
Extends test helper output to include kv_connector_worker_meta. |
tests/unit_tests/kv_offload/offloading_connector/utils.py |
Syncs offloading connector test harness with upstream async scheduling and keying changes. |
| @@ -111,7 +120,8 @@ def __init__(self, vllm_config: VllmConfig, kv_cache_config: KVCacheConfig): | |||
|
|
|||
| self.manager = MagicMock(spec=OffloadingManager) | |||
| self.manager.lookup.return_value = 0 | |||
There was a problem hiding this comment.
MockOffloadingSpec sets manager.lookup.return_value twice (first to 0, then to False). This looks accidental and makes it unclear what the mocked contract is supposed to be. Please remove the redundant assignment and keep a single return type/value consistent with OffloadingManager.lookup's expected behavior.
| self.manager.lookup.return_value = 0 |
| vllm.lora.utils._all_lora_classes = tuple( | ||
| HPUVocabParallelEmbeddingWithLoRA if cls is VocabParallelEmbeddingWithLoRA else cls for cls in _all_lora_classes) |
There was a problem hiding this comment.
Switching vllm.lora.utils._all_lora_classes to a tuple here makes it immutable, but other code in this repo still treats it as a mutable set (e.g., vllm_gaudi/lora/layers/hpu_row_parallel_linear.py:64-69 calls discard/add). When VLLM_ROW_PARALLEL_CHUNKS > 1, register_hpu_lora_layers() will raise AttributeError. Consider updating all call sites to work with the tuple-based API (e.g., rebuild a new tuple with replacements/removals) or provide a small helper here that performs tuple-safe updates and is reused everywhere.
| vllm.lora.utils._all_lora_classes = tuple( | |
| HPUVocabParallelEmbeddingWithLoRA if cls is VocabParallelEmbeddingWithLoRA else cls for cls in _all_lora_classes) | |
| def _replace_lora_class(classes, old_cls, new_cls): | |
| updated_classes = [ | |
| new_cls if cls is old_cls else cls for cls in classes | |
| ] | |
| if isinstance(classes, tuple): | |
| return tuple(updated_classes) | |
| if isinstance(classes, set): | |
| return set(updated_classes) | |
| return type(classes)(updated_classes) | |
| vllm.lora.utils._all_lora_classes = _replace_lora_class( | |
| _all_lora_classes, | |
| VocabParallelEmbeddingWithLoRA, | |
| HPUVocabParallelEmbeddingWithLoRA, | |
| ) |
| assert scoring_func in ["sigmoid", "softmax", "sqrtsoftplus"] | ||
|
|
||
| if e_score_correction_bias is not None or hash_indices_table is not None: | ||
| return FusedTopKBiasRouter( |
There was a problem hiding this comment.
The docstring and priority-order comments are now out of sync with the implementation: scoring_func is asserted to allow "sqrtsoftplus", and FusedTopKBiasRouter is selected when hash_indices_table is provided (even if e_score_correction_bias is None). Please update the selection-order bullets and the scoring_func description in the docstring to match the new behavior.
1d3557b to
f70f679
Compare
6c01231 to
342e65b
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
…A classes tuple (upstream #40560, #35077) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…k V4 router API (upstream #40860) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…ream per-job store completion (upstream #39186, #39403) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…_fused_moe_router (upstream #40860) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…h → forward (upstream #40560) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
… per-job API (upstream #39186) Upstream vLLM PR#39186 changed OffloadingConnectorMetadata from reqs_to_store/reqs_to_load (dict[ReqId, TransferSpec]) to store_jobs/load_jobs (dict[int, TransferJob]) with req_id inside TransferJob. Update _get_prompts_and_decodes to extract req_ids from the new structure. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…in HPU punica wrapper (upstream #35077) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…MoE runner refactor (upstream #40560) - Handle is_internal_router being a read-only @Property on INC-wrapped FusedMoE modules (PatchedMixtralMoE) by checking for property descriptor before attempting attribute assignment. - Accept new input_ids parameter in patched_fused_moe_forward to match upstream MoERunner.forward signature change. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…ejection_sample (upstream #40662) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…orward calls (upstream #40860) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
cb37f3b to
fe67ede
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Fix multiple upstream vLLM regressions breaking vllm-gaudi unit tests at vllm commit
5b39b268f506150dbab38f6f6c04b7c843e37c07Fixes
1. MoE runner import rename (upstream #40560)
moe_runner_base.pywas removed and split intomoe_runner.py+moe_runner_interface.py.MoERunnerBaseclass renamed toMoERunner.Changes:
vllm_gaudi/ops/hpu_fused_moe.py: ImportMoERunner as MoERunnerBasefrommoe_runnermodule; updateget_layer_from_nameimport pathvllm_gaudi/ops/hpu_lora.py: Change_all_lora_classesfromsettotuple(upstream #35077)2. DeepSeek V4 router API —
hash_indices_table(upstream #40860)FusedMoE.__init__now passeshash_indices_tableandzero_expert_typetocreate_fused_moe_router(), andinput_idskwarg toapply_monolithic().Changes:
vllm_gaudi/ops/hpu_fused_moe.py: Addhash_indices_tableparameter to HPUcreate_fused_moe_router()override; pass it toFusedTopKBiasRouter; addscoring_funcassertionvllm_gaudi/ops/hpu_compressed_tensors.py: Add**kwargstoHPUCompressedTensorsWNA16MoEMethod.apply_monolithic()forinput_idskwargtests/unit_tests/ops/utils.py: Addzero_expert_typeandhash_indices_tabletocreate_fused_moe()test helper3. MoE test
_forward_dispatchremoval (upstream #40560)MoERunnerBase._forward_dispatch()was removed. Tests must userunner.forward()with a properForwardContext.Changes:
tests/unit_tests/ops/test_hpu_fused_moe.py: Replace_forward_dispatchcall withrunner.forward(); use realForwardContextwithno_compile_layerstests/unit_tests/ops/test_hpu_compressed_tensors.py: Same migration4. KV offload scheduler — HMA multi-group + per-job store completion (upstream #39186, #39403, #38453, #39401, #39402)
Offloading scheduler was refactored for multi-group KV support and per-job store completion tracking.
Changes:
tests/unit_tests/kv_offload/offloading_connector/utils.py: Sync with upstream —OffloadKey/OffloadingWorkerMetadatatypes, async scheduling support,TransferJobStatustracking,build_connector_worker_meta()integrationtests/unit_tests/kv_offload/utils.py: Addkv_connector_worker_metaparameter tocreate_model_runner_output()5. OffloadingConnectorMetadata per-job API in model runner (upstream #39186)
OffloadingConnectorMetadatafields were renamed fromreqs_to_store/reqs_to_load(dict[ReqId, TransferSpec]) tostore_jobs/load_jobs(dict[int, TransferJob]), withreq_idnow insideTransferJob. This causedEngineDeadErrorcrashes in all tests that use KV offloading or LoRA (3 CI failures).Changes:
vllm_gaudi/v1/worker/hpu_model_runner.py: Update_get_prompts_and_decodes()to extractreq_idfromTransferJobobjects instore_jobs/load_jobsinstead of iterating over removedreqs_to_store/reqs_to_load— both for directOffloadingConnectorMetadataand nestedMultiKVConnectorMetadatacases6. LoRA punica wrapper — add_shrink / add_expand (upstream #35077)
Upstream refactored
PunicaWrapperBaseto addadd_shrink()andadd_expand()methods. HPU punica wrapper was missing these, causingAttributeError.Changes:
vllm_gaudi/lora/hpu_punica_wrapper.py: Implementadd_shrink()andadd_expand()methods in HPU punica wrapper7. Rejection sampler — synthetic_mode kwarg (upstream #40662)
Upstream added
synthetic_modeparameter torejection_sample(). HPU override was missing it.Changes:
vllm_gaudi/v1/worker/hpu_model_runner.py: Acceptsynthetic_modekwarg in HPUrejection_sampleoverride8. MoE forward — pass input_ids to custom op (upstream #40860)
DeepSeek V4 support added
input_idsparameter at position 3 of_moe_forward/_moe_forward_sharedcustom ops.patched_fused_moe_forwardwas not passing it through to_forward_impland_forward_entry, causingRuntimeError: Expected Optional[Tensor] but found strinhpu_dp_tests.Changes:
vllm_gaudi/ops/hpu_fused_moe.py: Passinput_idsthrough both_forward_impland_forward_entrycalls inpatched_fused_moe_forward()