Skip to content

[FIX_FOR_VLLM_CUSTOM=dcacdf9a8860a86401127d1c8f93ebf3cfbfd026] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring#1436

Merged
iboiko-habana merged 10 commits into
vllm-project:mainfrom
pawel-olejniczak:fix/vllm-hourly-ci-fixes
May 19, 2026
Merged

[FIX_FOR_VLLM_CUSTOM=dcacdf9a8860a86401127d1c8f93ebf3cfbfd026] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring#1436
iboiko-habana merged 10 commits into
vllm-project:mainfrom
pawel-olejniczak:fix/vllm-hourly-ci-fixes

Conversation

@pawel-olejniczak
Copy link
Copy Markdown
Collaborator

@pawel-olejniczak pawel-olejniczak commented May 11, 2026

Fix upstream regressions affecting hourly CI:

  1. MultiModelEngineClient: Added missing notify_kv_transfer_request_rejected abstract method (upstream PR [Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks vllm#41269)
  2. Qwen3.5 test harness: Updated test_common.py to read enforce_eager from model card config (with env var override), enabling per-model compilation control
  3. EPLB refactoring: Removed EMPTY_EPLB_STATE import and enable_eplb parameter from patched_create_fused_moe_router after upstream MoE refactor (upstream PR [MoE Refactor] EPLB refactoring for FusedMoE vllm#41055)

Note: The enforce_eager: true workaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by #1449, which should merge first.

Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors).

@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to restore compatibility with upstream vLLM changes and to avoid an HPU torch.compile failure for the Qwen3.5-35B-A3B evaluation path by forcing eager execution.

Changes:

  • Implement EngineClient.notify_kv_transfer_request_rejected() in MultiModelEngineClient by delegating to the underlying engine.
  • Update the Qwen3.5-35B-A3B full-test model card to set enforce_eager: true (intended to bypass graph compilation).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vllm_gaudi/entrypoints/openai/multi_model_api_server.py Adds delegation method to satisfy upstream EngineClient abstract API and includes minor formatting-only adjustments.
tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml Adds enforce_eager: true to attempt to disable compile/graph capture for this model’s full-test run.

Comment thread tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml
@pawel-olejniczak pawel-olejniczak force-pushed the fix/vllm-hourly-ci-fixes branch 3 times, most recently from 4a53ec6 to 81b2daa Compare May 12, 2026 13:50
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@pawel-olejniczak pawel-olejniczak changed the title [FIX_FOR_VLLM_CUSTOM=9efdddca283cc0eb7c37fa49c4a9d1c9bf59ec4e] Fix MultiModelEngineClient abstract method and Qwen3.5 compilation [FIX_FOR_VLLM_LATEST] Fix MultiModelEngineClient abstract method and Qwen3.5 compilation May 12, 2026
@pawel-olejniczak pawel-olejniczak changed the title [FIX_FOR_VLLM_LATEST] Fix MultiModelEngineClient abstract method and Qwen3.5 compilation [FIX_FOR_VLLM_CUSTOM=9efdddca283cc0eb7c37fa49c4a9d1c9bf59ec4e] Fix MultiModelEngineClient abstract method and Qwen3.5 compilation May 12, 2026
@pawel-olejniczak pawel-olejniczak force-pushed the fix/vllm-hourly-ci-fixes branch 2 times, most recently from c355fb7 to 6d45ba9 Compare May 13, 2026 09:02
…ltiModelEngineClient abstract method and Qwen3.5 compilation

- Add notify_kv_transfer_request_rejected() delegation to
  MultiModelEngineClient (upstream PR #41269 added new abstract method
  to EngineClient)
- Set enforce_eager=true for Qwen3.5-35B-A3B model card to work around
  aot_autograd view mutation assertion that fires during HPU graph
  compilation (upstream compilation changes between vLLM 8eb40113 and
  9efdddca trigger incompatibility with HPU's monkey-patched attention)

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
The enforce_eager parameter was only read from the ENFORCE_EAGER
environment variable, ignoring the value set in model card YAML files.
This caused the Qwen3.5-35B-A3B test to fail with a
BackendCompilerFailed error on HPU because torch.compile was not
disabled despite enforce_eager: true being set in the model card.

Read enforce_eager from eval_config (model card) first, with env var
as override — consistent with how trust_remote_code, dtype, and
other model config fields are handled.

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
@pawel-olejniczak pawel-olejniczak force-pushed the fix/vllm-hourly-ci-fixes branch from 6d45ba9 to 4488cec Compare May 13, 2026 16:59
@pawel-olejniczak pawel-olejniczak changed the title [FIX_FOR_VLLM_CUSTOM=9efdddca283cc0eb7c37fa49c4a9d1c9bf59ec4e] Fix MultiModelEngineClient abstract method and Qwen3.5 compilation [FIX_FOR_VLLM_LATEST] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring May 13, 2026
@pawel-olejniczak pawel-olejniczak changed the title [FIX_FOR_VLLM_LATEST] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring [FIX_FOR_VLLM_CUSTOM=dcacdf9a8860a86401127d1c8f93ebf3cfbfd026] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring May 13, 2026
@pawel-olejniczak pawel-olejniczak force-pushed the fix/vllm-hourly-ci-fixes branch from 4488cec to 35b43be Compare May 13, 2026 19:00
… EMPTY_EPLB_STATE import and enable_eplb parameter after upstream EPLB refactor

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
@pawel-olejniczak pawel-olejniczak force-pushed the fix/vllm-hourly-ci-fixes branch from 35b43be to b8231a4 Compare May 14, 2026 08:13
Comment thread tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml Outdated
pawel-olejniczak and others added 4 commits May 15, 2026 11:00
Proper fix for Qwen3.5 compilation (mamba_type Enum comparison)
is in PR vllm-project#1449. The enforce_eager workaround causes performance
degradation and is unnecessary once vllm-project#1449 merges.

Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
dcacdf9a8860a86401127d1c8f93ebf3cfbfd026

@iboiko-habana iboiko-habana merged commit c0a59cf into vllm-project:main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants