[Feature] Add Triton kernel JIT compilation monitor for inference by arpera · Pull Request #40137 · vllm-project/vllm

arpera · 2026-04-17T11:45:52Z

Purpose

Enables Triton JIT compilation and autotuning warnings during inference by default. After warmup completes, any such event is logged as a WARNING, letting developers quickly spot warmup/inference path divergences that cause latency spikes.

Recent cases like #37338 and #39169 were found only after time-consuming investigation of 1st-vs-2nd benchmark performance gaps. This monitor makes such issues immediately visible in server logs, saving developer time.

Test Result

1) Unit tests — 13 passed:

python -m pytest tests/compile/test_kernel_jit_monitor.py -v

2) E2E test — server + benchmark on 8×B200:

Server:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 -tp 1 -pp 1 -dp 8 \
  --enable-expert-parallel \
  --language-model-only \
  --reasoning-parser qwen3 \
  --stream-interval 100

Benchmark (run twice):

vllm bench serve --backend vllm \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 --endpoint /v1/completions \
  --dataset-name random --random-input-len 8192 \
  --random-output-len 1 --max-concurrency 128 \
  --num-prompts 1024 --ignore-eos --temperature 0.0

First benchmark — monitor detected 4 Triton kernels compiled during inference:

_zero_kv_blocks_kernel (KV cache block zeroing)
_compute_slot_mapping_kernel (KV cache slot mapping)
_copy_page_indices_kernel (FlashInfer page index copy)
_causal_conv1d_fwd_kernel (Mamba causal conv1d)

Second benchmark — no warnings (kernels already cached in memory).

This confirms a real warmup/inference path divergence that should be addressed in a follow-up PR.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a kernel JIT monitor designed to detect and log unexpected Triton JIT compilations and autotuning events during inference, which can cause latency spikes. The monitor is integrated into the GPU worker to activate after model warmup. Review feedback identifies a critical issue where the JIT post-compile hook must return the compiled kernel object by calling the provided compile closure to ensure compatibility with Triton's API and prevent execution failures. Corresponding updates to the unit tests are also required to verify this return behavior.

arpera · 2026-04-17T12:24:31Z

@ZJY0516, this is a solution for tracing warmup and real inference path divergence of Triton autotuning being discussed in a recent thread. Please, have a look!

Also this patch has already identified path divergence for Qwen3.5. I will have a look into it.

arpera · 2026-04-17T12:32:01Z

@claude review

mergify · 2026-04-17T16:03:15Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Fixes the check-torch-cuda-call pre-commit hook failure in tests/compile/test_kernel_jit_monitor.py. Per RFC vllm-project#30679, use the torch.accelerator API instead of torch.cuda. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Made-with: Cursor

…on_utils Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

ZJY0516

LGTM

qiching

LGTM

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-04-30T14:26:33Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tdoublep

I like this idea but isn't JIT compilation during inference quite hard to avoid in some cases? I worry that enabling this by default may end up printing a lot of warnings.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera · 2026-05-01T15:13:01Z

I like this idea but isn't JIT compilation during inference quite hard to avoid in some cases?

Probably, but still, if we care about high performance (and we definitely do), we at the very least need to be able to see such issues so we don’t allow the already significant performance gap between cold and warm vLLM starts to grow even further.

I worry that enabling this by default may end up printing a lot of warnings.

Yes, there is such a problem. At least, I could try to print for each distinct kernel only one message in the log via using additional set for storing already printed kernels. Would it be better from your point of view?

ZJY0516 · 2026-05-01T15:14:56Z

@arpera hey, FYI, vllm has warning_once

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…lm-project#40137) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

[Feature] Add Triton kernel JIT compilation monitor for inference

2182e30

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera requested review from BoyuanFeng, ProExpertProg, njhill, vadiklyutiy, youkaichao and zou3519 as code owners April 17, 2026 11:45

claude Bot reviewed Apr 17, 2026

View reviewed changes

mergify Bot added the v1 label Apr 17, 2026

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread vllm/triton_utils/jit_monitor.py

Comment thread tests/test_jit_monitor.py

Merge branch 'main' into jit-monitor

5fb3dcd

arpera added 2 commits April 20, 2026 16:52

Merge branch 'main' into jit-monitor

c23953c

vadiklyutiy reviewed Apr 20, 2026

View reviewed changes

Comment thread vllm/triton_utils/jit_monitor.py

qiching reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm/triton_utils/jit_monitor.py Outdated

[Refactor] Move kernel JIT monitor from vllm/compilation to vllm/trit…

776caeb

…on_utils Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify Bot added the ci/build label Apr 27, 2026

[Refactor] Use **kwargs in Triton JIT compile hook

fe38296

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

ZJY0516 reviewed Apr 27, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu_worker.py Outdated

Comment thread vllm/triton_utils/jit_monitor.py

ZJY0516 approved these changes Apr 28, 2026

View reviewed changes

qiching approved these changes Apr 28, 2026

View reviewed changes

vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2026

Merge branch 'main' into jit-monitor

4386097

vadiklyutiy enabled auto-merge (squash) April 29, 2026 15:42

arpera added 3 commits April 29, 2026 20:49

Merge branch 'main' into jit-monitor

d7bcf8f

Merge branch 'main' into jit-monitor

ea9b38f

Rename activate_jit_monitor to activate_triton_jit_monitor

fa63579

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

auto-merge was automatically disabled April 30, 2026 09:46
Head branch was pushed to by a user without write access

Merge branch 'main' into jit-monitor

3e563f1

Merge branch 'main' into jit-monitor

228e233

tdoublep reviewed May 1, 2026

View reviewed changes

Comment thread vllm/triton_utils/jit_monitor.py Outdated

[Refactor] Use HAS_TRITON guard in jit_monitor instead of try/except

7a6e4c3

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

ZJY0516 reviewed May 1, 2026

View reviewed changes

Comment thread vllm/triton_utils/jit_monitor.py Outdated

arpera and others added 3 commits May 1, 2026 23:19

Address review: warning_once for JIT compile, info for activation

5af5b88

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

[Test] Update jit_monitor activation test to match logger.info

e7cf568

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Merge branch 'main' into jit-monitor

a6e906b

vadiklyutiy merged commit 8b9ea2f into vllm-project:main May 5, 2026
62 checks passed

Uh oh!

Conversation

arpera commented Apr 17, 2026

Purpose

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

arpera commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arpera commented Apr 17, 2026

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arpera commented May 1, 2026

Uh oh!

ZJY0516 commented May 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arpera commented Apr 17, 2026 •

edited

Loading