Skip to content

fix(xpu): Re-compute compile ranges after platform-specific config updates#37523

Merged
ProExpertProg merged 7 commits intovllm-project:mainfrom
Liangyx2:VLLMZ-905
Mar 20, 2026
Merged

fix(xpu): Re-compute compile ranges after platform-specific config updates#37523
ProExpertProg merged 7 commits intovllm-project:mainfrom
Liangyx2:VLLMZ-905

Conversation

@Liangyx2
Copy link
Contributor

@Liangyx2 Liangyx2 commented Mar 19, 2026

Summary

Fix compile range computation order to respect platform-specific scheduler config updates.
This ensures torch.compile warmup uses sizes that are valid for the actual max_num_batched_tokens,
particularly on XPU backend where MLA models have constraints that reduce this value.

Issue

When using torch.compile mode on XPU backend with MLA-enabled models:

  1. Initial _set_compile_ranges() computes endpoints based on default max_num_batched_tokens (e.g., 8192)
  2. check_and_update_config() for XPU detects MLA and lowers max_num_batched_tokens (e.g., to 4096)
  3. During warmup, compile attempts to execute with size 8192 but max_num_tokens is now 4096
  4. Assertion fails in _dummy_run(): assert num_tokens <= self.max_num_tokens

This is a configuration order bug - compile ranges should be finalized AFTER platform
config updates, not before.

Root Cause

In VllmConfig.__post_init__():

  • Line 989: _set_compile_ranges() called (uses original max_num_batched_tokens)
  • Line 1023: check_and_update_config() called (XPU may lower max_num_batched_tokens)
  • Compile ranges never updated to reflect the new limit

Fix

Move the second _set_compile_ranges() call to execute immediately after
check_and_update_config() to ensure compile ranges reflect all platform-specific
scheduler config updates.

Testing

  • Tested with Pinaster/GLM-5_4layer model (MLA enabled) on XPU
  • Compile mode now successfully initializes without AssertionError

Changes

  • vllm/config/vllm.py: Re-invoke _set_compile_ranges() after check_and_update_config()

Notes

This is a root-cause fix addressing the configuration order issue rather than
working around it in the warmup phase. It applies universally and prevents similar
issues on other platforms with custom scheduler config logic.

@Liangyx2 Liangyx2 requested a review from njhill as a code owner March 19, 2026 06:47
@mergify mergify bot added the v1 label Mar 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an AssertionError during model compilation warmup by filtering out warmup sizes that exceed the model runner's token capacity. The change ensures that _dummy_run is only called with valid sizes, preventing the crash. My feedback includes a suggestion to optimize the code by partitioning the warmup sizes in a single pass, which improves efficiency and readability.

@jikunshang
Copy link
Collaborator

what's your invalid warmup sizes and max_num_tokens? I think this should never happen.

@Liangyx2
Copy link
Contributor Author

The XPU compile warmup runs one or more forward passes using a set of “preset/enumerated” token counts to trigger graph compilation; however, under certain configurations, these token counts may exceed the actual token capacity allowed by the current ModelRunner, making them “invalid warmup sizes.”

Skipping invalid compile warmup sizes [8192] because they exceed max_num_tokens=4096.

@jikunshang
Copy link
Collaborator

The XPU compile warmup runs one or more forward passes using a set of “preset/enumerated” token counts to trigger graph compilation; however, under certain configurations, these token counts may exceed the actual token capacity allowed by the current ModelRunner, making them “invalid warmup sizes.”

Skipping invalid compile warmup sizes [8192] because they exceed max_num_tokens=4096.

does cuda have same behavior? if not, there may be some xpu config setting error. we should fix that instead.

@Liangyx2 Liangyx2 changed the title [VLLMZ-905] fix(xpu): Clamp compile warmup sizes to model runner token capacity [VLLMZ-905] fix(xpu): Re-compute compile ranges after platform-specific config updates Mar 19, 2026
Liangyx2 and others added 4 commits March 19, 2026 16:13
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yuxiang Liang <yuliang@habana.ai>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
…ation updates

Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
@mergify
Copy link

mergify bot commented Mar 19, 2026

Hi @Liangyx2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@jikunshang
Copy link
Collaborator

@jikunshang
Copy link
Collaborator

and I plan to land this for MLA #37143 recently.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please remove the old call though

@Liangyx2 Liangyx2 changed the title [VLLMZ-905] fix(xpu): Re-compute compile ranges after platform-specific config updates fix(xpu): Re-compute compile ranges after platform-specific config updates Mar 19, 2026
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
@mergify
Copy link

mergify bot commented Mar 19, 2026

Hi @Liangyx2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work! Please fix the pre-commit issue so that we can land this

@jikunshang jikunshang enabled auto-merge (squash) March 20, 2026 02:03
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
auto-merge was automatically disabled March 20, 2026 02:04

Head branch was pushed to by a user without write access

@ProExpertProg ProExpertProg enabled auto-merge (squash) March 20, 2026 02:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026
@ProExpertProg ProExpertProg merged commit 638a872 into vllm-project:main Mar 20, 2026
51 checks passed
chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026
…dates (vllm-project#37523)

Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuliang@habana.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Mar 23, 2026
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](vllm-project/vllm#33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](vllm-project/vllm#37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8b63257

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 25, 2026
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](vllm-project/vllm#33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](vllm-project/vllm#37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8b63257

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants