[FIX_FOR_VLLM_CUSTOM=dcacdf9a8860a86401127d1c8f93ebf3cfbfd026] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring#1436
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
There was a problem hiding this comment.
Pull request overview
This PR aims to restore compatibility with upstream vLLM changes and to avoid an HPU torch.compile failure for the Qwen3.5-35B-A3B evaluation path by forcing eager execution.
Changes:
- Implement
EngineClient.notify_kv_transfer_request_rejected()inMultiModelEngineClientby delegating to the underlying engine. - Update the Qwen3.5-35B-A3B full-test model card to set
enforce_eager: true(intended to bypass graph compilation).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vllm_gaudi/entrypoints/openai/multi_model_api_server.py |
Adds delegation method to satisfy upstream EngineClient abstract API and includes minor formatting-only adjustments. |
tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml |
Adds enforce_eager: true to attempt to disable compile/graph capture for this model’s full-test run. |
4a53ec6 to
81b2daa
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
c355fb7 to
6d45ba9
Compare
…ltiModelEngineClient abstract method and Qwen3.5 compilation - Add notify_kv_transfer_request_rejected() delegation to MultiModelEngineClient (upstream PR #41269 added new abstract method to EngineClient) - Set enforce_eager=true for Qwen3.5-35B-A3B model card to work around aot_autograd view mutation assertion that fires during HPU graph compilation (upstream compilation changes between vLLM 8eb40113 and 9efdddca trigger incompatibility with HPU's monkey-patched attention) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
The enforce_eager parameter was only read from the ENFORCE_EAGER environment variable, ignoring the value set in model card YAML files. This caused the Qwen3.5-35B-A3B test to fail with a BackendCompilerFailed error on HPU because torch.compile was not disabled despite enforce_eager: true being set in the model card. Read enforce_eager from eval_config (model card) first, with env var as override — consistent with how trust_remote_code, dtype, and other model config fields are handled. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
6d45ba9 to
4488cec
Compare
4488cec to
35b43be
Compare
… EMPTY_EPLB_STATE import and enable_eplb parameter after upstream EPLB refactor Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
35b43be to
b8231a4
Compare
Proper fix for Qwen3.5 compilation (mamba_type Enum comparison) is in PR vllm-project#1449. The enforce_eager workaround causes performance degradation and is unnecessary once vllm-project#1449 merges. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Fix upstream regressions affecting hourly CI:
notify_kv_transfer_request_rejectedabstract method (upstream PR [Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks vllm#41269)test_common.pyto readenforce_eagerfrom model card config (with env var override), enabling per-model compilation controlEMPTY_EPLB_STATEimport andenable_eplbparameter frompatched_create_fused_moe_routerafter upstream MoE refactor (upstream PR [MoE Refactor] EPLB refactoring for FusedMoE vllm#41055)Note: The
enforce_eager: trueworkaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by #1449, which should merge first.Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors).