Skip to content

[Feature] Add Triton kernel JIT compilation monitor for inference#40137

Merged
vadiklyutiy merged 16 commits into
vllm-project:mainfrom
arpera:jit-monitor
May 5, 2026
Merged

[Feature] Add Triton kernel JIT compilation monitor for inference#40137
vadiklyutiy merged 16 commits into
vllm-project:mainfrom
arpera:jit-monitor

Conversation

@arpera
Copy link
Copy Markdown
Contributor

@arpera arpera commented Apr 17, 2026

Purpose

Enables Triton JIT compilation and autotuning warnings during inference by default. After warmup completes, any such event is logged as a WARNING, letting developers quickly spot warmup/inference path divergences that cause latency spikes.

Recent cases like #37338 and #39169 were found only after time-consuming investigation of 1st-vs-2nd benchmark performance gaps. This monitor makes such issues immediately visible in server logs, saving developer time.

Test Result

1) Unit tests — 13 passed:

python -m pytest tests/compile/test_kernel_jit_monitor.py -v

2) E2E test — server + benchmark on 8×B200:

Server:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 -tp 1 -pp 1 -dp 8 \
  --enable-expert-parallel \
  --language-model-only \
  --reasoning-parser qwen3 \
  --stream-interval 100

Benchmark (run twice):

vllm bench serve --backend vllm \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 --endpoint /v1/completions \
  --dataset-name random --random-input-len 8192 \
  --random-output-len 1 --max-concurrency 128 \
  --num-prompts 1024 --ignore-eos --temperature 0.0

First benchmark — monitor detected 4 Triton kernels compiled during inference:

  • _zero_kv_blocks_kernel (KV cache block zeroing)
  • _compute_slot_mapping_kernel (KV cache slot mapping)
  • _copy_page_indices_kernel (FlashInfer page index copy)
  • _causal_conv1d_fwd_kernel (Mamba causal conv1d)

Second benchmark — no warnings (kernels already cached in memory).

This confirms a real warmup/inference path divergence that should be addressed in a follow-up PR.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Apr 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a kernel JIT monitor designed to detect and log unexpected Triton JIT compilations and autotuning events during inference, which can cause latency spikes. The monitor is integrated into the GPU worker to activate after model warmup. Review feedback identifies a critical issue where the JIT post-compile hook must return the compiled kernel object by calling the provided compile closure to ensure compatibility with Triton's API and prevent execution failures. Corresponding updates to the unit tests are also required to verify this return behavior.

Comment thread vllm/triton_utils/jit_monitor.py
Comment thread tests/test_jit_monitor.py
@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Apr 17, 2026

@ZJY0516, this is a solution for tracing warmup and real inference path divergence of Triton autotuning being discussed in a recent thread. Please, have a look!

Also this patch has already identified path divergence for Qwen3.5. I will have a look into it.

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Apr 17, 2026

@claude review

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 17, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

arpera added 2 commits April 20, 2026 16:52
Fixes the check-torch-cuda-call pre-commit hook failure in
tests/compile/test_kernel_jit_monitor.py. Per RFC vllm-project#30679, use the
torch.accelerator API instead of torch.cuda.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Made-with: Cursor
Comment thread vllm/triton_utils/jit_monitor.py
Comment thread vllm/triton_utils/jit_monitor.py Outdated
…on_utils

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify mergify Bot added the ci/build label Apr 27, 2026
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Comment thread vllm/v1/worker/gpu_worker.py Outdated
Comment thread vllm/triton_utils/jit_monitor.py
Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vadiklyutiy vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2026
@vadiklyutiy vadiklyutiy enabled auto-merge (squash) April 29, 2026 15:42
auto-merge was automatically disabled April 30, 2026 09:46

Head branch was pushed to by a user without write access

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 30, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Copy Markdown
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea but isn't JIT compilation during inference quite hard to avoid in some cases? I worry that enabling this by default may end up printing a lot of warnings.

Comment thread vllm/triton_utils/jit_monitor.py Outdated
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented May 1, 2026

I like this idea but isn't JIT compilation during inference quite hard to avoid in some cases?

Probably, but still, if we care about high performance (and we definitely do), we at the very least need to be able to see such issues so we don’t allow the already significant performance gap between cold and warm vLLM starts to grow even further.

I worry that enabling this by default may end up printing a lot of warnings.

Yes, there is such a problem. At least, I could try to print for each distinct kernel only one message in the log via using additional set for storing already printed kernels. Would it be better from your point of view?

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented May 1, 2026

@arpera hey, FYI, vllm has warning_once

Comment thread vllm/triton_utils/jit_monitor.py Outdated
arpera and others added 3 commits May 1, 2026 23:19
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@vadiklyutiy vadiklyutiy merged commit 8b9ea2f into vllm-project:main May 5, 2026
62 checks passed
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
…lm-project#40137)

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…lm-project#40137)

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
…lm-project#40137)

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
…lm-project#40137)

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants