[Feature] Add Triton kernel JIT compilation monitor for inference#40137
Conversation
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a kernel JIT monitor designed to detect and log unexpected Triton JIT compilations and autotuning events during inference, which can cause latency spikes. The monitor is integrated into the GPU worker to activate after model warmup. Review feedback identifies a critical issue where the JIT post-compile hook must return the compiled kernel object by calling the provided compile closure to ensure compatibility with Triton's API and prevent execution failures. Corresponding updates to the unit tests are also required to verify this return behavior.
|
@claude review |
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Fixes the check-torch-cuda-call pre-commit hook failure in tests/compile/test_kernel_jit_monitor.py. Per RFC vllm-project#30679, use the torch.accelerator API instead of torch.cuda. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Made-with: Cursor
…on_utils Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Head branch was pushed to by a user without write access
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
tdoublep
left a comment
There was a problem hiding this comment.
I like this idea but isn't JIT compilation during inference quite hard to avoid in some cases? I worry that enabling this by default may end up printing a lot of warnings.
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Probably, but still, if we care about high performance (and we definitely do), we at the very least need to be able to see such issues so we don’t allow the already significant performance gap between cold and warm vLLM starts to grow even further.
Yes, there is such a problem. At least, I could try to print for each distinct kernel only one message in the log via using additional set for storing already printed kernels. Would it be better from your point of view? |
|
@arpera hey, FYI, vllm has |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Purpose
Enables Triton JIT compilation and autotuning warnings during inference by default. After warmup completes, any such event is logged as a WARNING, letting developers quickly spot warmup/inference path divergences that cause latency spikes.
Recent cases like #37338 and #39169 were found only after time-consuming investigation of 1st-vs-2nd benchmark performance gaps. This monitor makes such issues immediately visible in server logs, saving developer time.
Test Result
1) Unit tests — 13 passed:
2) E2E test — server + benchmark on 8×B200:
Server:
Benchmark (run twice):
First benchmark — monitor detected 4 Triton kernels compiled during inference:
_zero_kv_blocks_kernel(KV cache block zeroing)_compute_slot_mapping_kernel(KV cache slot mapping)_copy_page_indices_kernel(FlashInfer page index copy)_causal_conv1d_fwd_kernel(Mamba causal conv1d)Second benchmark — no warnings (kernels already cached in memory).
This confirms a real warmup/inference path divergence that should be addressed in a follow-up PR.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.