Fix CI fail hang by xuechendi · Pull Request #6 · vllm-project/vllm-gaudi

xuechendi · 2025-07-03T18:54:57Z

only left basic test to accelerate UT
add try exception in CI, previously, for some reason, failed test stucked at release resource for very long. Fix it here

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…UOffloadingSpec import path and remove obsolete roberta patch (#1229) ## Summary Multiple upstream vLLM changes broke the hourly CI (RED since 2026-03-23 13:07 UTC): 1. **CPUOffloadingSpec import path** — upstream PR #37874 refactored cpu.py into a cpu/ package 2. **replace_roberta_positions removed** — upstream PR #37884 3. **vllm_is_batch_invariant removed** — replaced with envs.VLLM_BATCH_INVARIANT 4. **key_cache guard for None** — upstream decode path can pass None 5. **Synapse SDPA error 400** — continuation prefills triggered Synapse errors 6. **Attention.kv_cache list-to-element refactor** — upstream PR #37487 (c59a132f9) changed Attention.kv_cache from list to tensor. HPU code used self.kv_cache[0] producing garbage output. ## Changes - cpu_hpu.py: Updated CPUOffloadingSpec import path - models/roberta.py: Removed obsolete monkey-patch - __init__.py: Removed roberta import - vllm_gaudi_batch_invariant.py: Replace vllm_is_batch_invariant with envs.VLLM_BATCH_INVARIANT - ops/hpu_paged_attn.py: Guard decode path against None key_cache - attention/hpu_attn.py: Fix SDPA padding for continuation prefills - **ops/hpu_attention.py**: self.kv_cache[0] -> self.kv_cache (fix #6) - **attention/oot_mla.py**: self.kv_cache[0] -> self.kv_cache (fix #6) ## HPU Verification (Gaudi 3, HL-325) **kv_cache fix A/B test:** - WITHOUT fix (self.kv_cache[0]): garbage output - WITH fix (self.kv_cache): correct coherent output **Remaining test issues (NOT caused by this PR):** - test_cpu_offloading: CPU offloading perf issue on HPU (separate) - test_llama_lora: 2/4 SQL outputs mismatch (was garbage without fix, now valid SQL) ## Impact Fixes ALL 50+ e2e test failures and restores correct model output on HPU. --- *AI-assisted: All changes reviewed and verified on HPU hardware.* --------- Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Fix CI fail hang

4d961c1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the fix_ci_fail_hang branch from ad52fee to 4d961c1 Compare July 3, 2025 18:57

xuechendi merged commit f75ff7b into main Jul 3, 2025
1 check passed

kzawora-intel deleted the fix_ci_fail_hang branch July 10, 2025 15:48

pawel-olejniczak mentioned this pull request Mar 26, 2026

[FIX_FOR_VLLM_CUSTOM=14acf429ac08b6d538ca6feb3e06b6d13895804d] Fix CPUOffloadingSpec import path and remove obsolete roberta patch #1229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI fail hang#6

Fix CI fail hang#6
xuechendi merged 1 commit intomainfrom
fix_ci_fail_hang

xuechendi commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xuechendi commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant