[Bugfix][V1] Warm up slot mapping before JIT monitor#42165
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a warmup mechanism for the V1 slot mapping kernel to ensure it is compiled before the JIT monitor is enabled. Key changes include the implementation of warmup_v1_slot_mapping_kernel, its integration into the GPUWorker warmup sequence, and a Triton JIT optimization to prevent specialization on num_tokens. New unit tests verify the warmup process and its error handling. I have no feedback to provide.
|
@ZJY0516 @qiching @tdoublep @vadiklyutiy Hi, this is follow-up fix for #40137. I found V1 path still triggers JIT warning on first real request. Fix is small — You all reviewed #40137 so your feedback would be very helpful. Thanks. |
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
cad2699 to
877e619
Compare
Purpose
#40137 added Triton JIT monitoring that activates after warmup finishes. But V1 warmup path (
_dummy_run()) never callsBlockTable.compute_slot_mapping(). So when first real request comes in,_compute_slot_mapping_kernelcompiles while JIT monitor is already active. Users see unexpected compilation warning during normal inference.Problem has two sides. First, V1 warmup simply does not exercise the slot mapping path. Second,
_compute_slot_mapping_kernelwas specialized onnum_tokensparameter, meaning even if we warm up with one token count, different request size triggers recompilation again.Fix is also two parts.
I add
do_not_specialize=["num_tokens"]to the kernel so one compilation covers all request sizes.max_num_tokensstays specialized — it is constant for engine lifetime and Triton can optimize the padding loop with it.I also add small warmup in
warmup_v1_slot_mapping_kernel()that callscompute_slot_mapping()directly before JIT monitor activates. It temporarily uses block id 1 (block 0 is null block), then clears infinallyblock. This runs on all PP ranks because every rank callscompute_slot_mapping()during input preparation.I did not add synthetic
execute_model()warmup. That needs model-specific dummy inputs and is not safe for all model types. This PR only covers slot mapping kernel.V2 warmup path is not touched. V1 sampler warmup is not touched.
Checked open PRs, no existing PR for this issue.
Test Plan
Test Result
tests/v1/worker/test_gpu_model_runner.py: 34 passed, 16 warnings.
ruff format / ruff check: passed.
mypy-3.10: passed.
git diff --check: passed.
Local smoke on V1 runner with Qwen3-8B text-only: HTTP 200, no
_compute_slot_mapping_kernelwarning on first request.Essential Elements of an Effective PR Description Checklist
AI assistance was used (Codex, Claude).