fix(xpu): Re-compute compile ranges after platform-specific config updates#37523
fix(xpu): Re-compute compile ranges after platform-specific config updates#37523ProExpertProg merged 7 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an AssertionError during model compilation warmup by filtering out warmup sizes that exceed the model runner's token capacity. The change ensures that _dummy_run is only called with valid sizes, preventing the crash. My feedback includes a suggestion to optimize the code by partitioning the warmup sizes in a single pass, which improves efficiency and readability.
|
what's your |
|
The XPU compile warmup runs one or more forward passes using a set of “preset/enumerated” token counts to trigger graph compilation; however, under certain configurations, these token counts may exceed the actual token capacity allowed by the current ModelRunner, making them “invalid warmup sizes.”
|
does cuda have same behavior? if not, there may be some xpu config setting error. we should fix that instead. |
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Yuxiang Liang <yuliang@habana.ai>
…ation updates Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
|
Hi @Liangyx2, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
actually I am thinking we should remove this https://github.com/vllm-project/vllm/blob/main/vllm/platforms/xpu.py#L212-L221 |
|
and I plan to land this for MLA #37143 recently. |
ProExpertProg
left a comment
There was a problem hiding this comment.
Looks good, please remove the old call though
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
|
Hi @Liangyx2, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Head branch was pushed to by a user without write access
…dates (vllm-project#37523) Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com> Signed-off-by: Yuxiang Liang <yuliang@habana.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](vllm-project/vllm#33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from **before** `check_and_update_config()` to **after** it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](vllm-project/vllm#37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](vllm-project/vllm#33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from **before** `check_and_update_config()` to **after** it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](vllm-project/vllm#37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
Summary
Fix compile range computation order to respect platform-specific scheduler config updates.
This ensures torch.compile warmup uses sizes that are valid for the actual max_num_batched_tokens,
particularly on XPU backend where MLA models have constraints that reduce this value.
Issue
When using torch.compile mode on XPU backend with MLA-enabled models:
_set_compile_ranges()computes endpoints based on default max_num_batched_tokens (e.g., 8192)check_and_update_config()for XPU detects MLA and lowers max_num_batched_tokens (e.g., to 4096)_dummy_run():assert num_tokens <= self.max_num_tokensThis is a configuration order bug - compile ranges should be finalized AFTER platform
config updates, not before.
Root Cause
In
VllmConfig.__post_init__():_set_compile_ranges()called (uses original max_num_batched_tokens)check_and_update_config()called (XPU may lower max_num_batched_tokens)Fix
Move the second
_set_compile_ranges()call to execute immediately aftercheck_and_update_config()to ensure compile ranges reflect all platform-specificscheduler config updates.
Testing
Changes
vllm/config/vllm.py: Re-invoke_set_compile_ranges()aftercheck_and_update_config()Notes
This is a root-cause fix addressing the configuration order issue rather than
working around it in the warmup phase. It applies universally and prevents similar
issues on other platforms with custom scheduler config logic.