Skip to content

[Bugfix][V1] Warm up slot mapping before JIT monitor#61

Draft
lesj0610 wants to merge 2 commits into
mainfrom
lesj/v1-slot-mapping-jit-warmup-upstream-20260509
Draft

[Bugfix][V1] Warm up slot mapping before JIT monitor#61
lesj0610 wants to merge 2 commits into
mainfrom
lesj/v1-slot-mapping-jit-warmup-upstream-20260509

Conversation

@lesj0610
Copy link
Copy Markdown
Owner

@lesj0610 lesj0610 commented May 9, 2026

Purpose

vllm-project#40137 added Triton JIT monitoring that activates after warmup finishes. But V1 warmup path (_dummy_run()) never calls BlockTable.compute_slot_mapping(). So when first real request comes in, _compute_slot_mapping_kernel compiles while JIT monitor is already active. Users see unexpected compilation warning during normal inference.

Problem has two sides. First, V1 warmup simply does not exercise the slot mapping path. Second, _compute_slot_mapping_kernel was specialized on num_tokens parameter, meaning even if we warm up with one token count, different request size triggers recompilation again.

Fix is also two parts.

I add do_not_specialize=["num_tokens"] to the kernel so one compilation covers all request sizes. max_num_tokens stays specialized — it is constant for engine lifetime and Triton can optimize the padding loop with it.

I also add small warmup in warmup_v1_slot_mapping_kernel() that calls compute_slot_mapping() directly before JIT monitor activates. It temporarily uses block id 1 (block 0 is null block), then clears in finally block. This runs on all PP ranks because every rank calls compute_slot_mapping() during input preparation.

I did not add synthetic execute_model() warmup. That needs model-specific dummy inputs and is not safe for all model types. This PR only covers slot mapping kernel.

V2 warmup path is not touched. V1 sampler warmup is not touched.

Checked open PRs, no existing PR for this issue.

Test Plan

.venv/bin/python -m pytest tests/v1/worker/test_gpu_model_runner.py -v

pre-commit run ruff-format --files \
  vllm/v1/worker/block_table.py \
  vllm/v1/worker/gpu/warmup.py \
  vllm/v1/worker/gpu_worker.py \
  tests/v1/worker/test_gpu_model_runner.py

pre-commit run ruff-check --files \
  vllm/v1/worker/block_table.py \
  vllm/v1/worker/gpu/warmup.py \
  vllm/v1/worker/gpu_worker.py \
  tests/v1/worker/test_gpu_model_runner.py

pre-commit run mypy-3.10 --files \
  vllm/v1/worker/block_table.py \
  vllm/v1/worker/gpu/warmup.py \
  vllm/v1/worker/gpu_worker.py \
  tests/v1/worker/test_gpu_model_runner.py \
  --hook-stage manual

git diff --check

Test Result

tests/v1/worker/test_gpu_model_runner.py: 34 passed, 16 warnings.

ruff format / ruff check: passed.

mypy-3.10: passed.

git diff --check: passed.

Local smoke on V1 runner with Qwen3-8B text-only: HTTP 200, no _compute_slot_mapping_kernel warning on first request.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR
  • The test plan, such as providing test command.
  • The test results

AI assistance was used (Codex, Claude).

… operand layout with WGMMA (vllm-project#42076)

Signed-off-by: kermit <ckeming@outlook.com>
@lesj0610 lesj0610 changed the title [V1] Warm up slot mapping before JIT monitor [Bugfix][V1] Warm up slot mapping before JIT monitor May 9, 2026
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610 lesj0610 force-pushed the lesj/v1-slot-mapping-jit-warmup-upstream-20260509 branch from cad2699 to 877e619 Compare May 9, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants