[V1] Warm up slot mapping before JIT monitor#60
Closed
lesj0610 wants to merge 1 commit into
Closed
Conversation
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
the JIT monitoring change added Triton JIT monitoring after warmup. In V1, current warmup path still does not cover the real request input preparation path for slot mapping.
So first real request can compile
_compute_slot_mapping_kernelafter the JIT monitor is already active, and users see warning during inference.Root cause is two parts:
_dummy_run()does not callBlockTable.compute_slot_mapping()._compute_slot_mapping_kernelwas specialized onnum_tokens, so compiling one token count does not cover another request size.This PR fixes that V1 slot mapping path.
I added a small V1 warmup that calls
BlockTable.compute_slot_mapping()directly before enabling the JIT monitor. It uses block id 1 because block 0 is the null block, then clears the temporary block table row infinally.I also mark
num_tokensasdo_not_specialize, so different request token counts reuse the same compiled kernel.max_num_tokensstays specialized because it is fixed for the engine.This does not run synthetic
execute_model()warmup. That path is model-specific and can require model-specific dummy inputs, so this PR keeps warmup limited to slot mapping only.V2 warmup is not changed. Existing V1 sampler warmup is not changed.
Checked open PRs — no existing PR for V1 slot mapping JIT warning after the JIT monitoring change.
Test Plan
Test Result
tests/v1/worker/test_gpu_model_runner.py: 42 passed, 16 warnings.ruff format / ruff check: passed.
mypy-3.10: passed.
git diff --check: passed.
Local OAI smoke on V1 runner:
_compute_slot_mapping_kernelwarning on first request.AI assistance was used (Codex, Claude).