[Bugfix][V1] Warm up slot mapping before JIT monitor#61
Draft
lesj0610 wants to merge 2 commits into
Draft
Conversation
… operand layout with WGMMA (vllm-project#42076) Signed-off-by: kermit <ckeming@outlook.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
cad2699 to
877e619
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
vllm-project#40137 added Triton JIT monitoring that activates after warmup finishes. But V1 warmup path (
_dummy_run()) never callsBlockTable.compute_slot_mapping(). So when first real request comes in,_compute_slot_mapping_kernelcompiles while JIT monitor is already active. Users see unexpected compilation warning during normal inference.Problem has two sides. First, V1 warmup simply does not exercise the slot mapping path. Second,
_compute_slot_mapping_kernelwas specialized onnum_tokensparameter, meaning even if we warm up with one token count, different request size triggers recompilation again.Fix is also two parts.
I add
do_not_specialize=["num_tokens"]to the kernel so one compilation covers all request sizes.max_num_tokensstays specialized — it is constant for engine lifetime and Triton can optimize the padding loop with it.I also add small warmup in
warmup_v1_slot_mapping_kernel()that callscompute_slot_mapping()directly before JIT monitor activates. It temporarily uses block id 1 (block 0 is null block), then clears infinallyblock. This runs on all PP ranks because every rank callscompute_slot_mapping()during input preparation.I did not add synthetic
execute_model()warmup. That needs model-specific dummy inputs and is not safe for all model types. This PR only covers slot mapping kernel.V2 warmup path is not touched. V1 sampler warmup is not touched.
Checked open PRs, no existing PR for this issue.
Test Plan
Test Result
tests/v1/worker/test_gpu_model_runner.py: 34 passed, 16 warnings.
ruff format / ruff check: passed.
mypy-3.10: passed.
git diff --check: passed.
Local smoke on V1 runner with Qwen3-8B text-only: HTTP 200, no
_compute_slot_mapping_kernelwarning on first request.Essential Elements of an Effective PR Description Checklist
AI assistance was used (Codex, Claude).