[Bugfix][V1][MoE] Warm up WNA16 MoE Triton kernels by lesj0610 · Pull Request #42193 · vllm-project/vllm

lesj0610 · 2026-05-10T01:17:24Z

What this fixes

In V1, WNA16 fused MoE warmup has a gap. Startup dummy run uses small batch, so should_moe_wna16_use_cuda picks CUDA path and Triton kernel never gets compiled. When first real request comes with larger token count, fused_moe_kernel_gptq_awq compiles on the fly during inference.

What I changed

First, fused_moe_kernel_gptq_awq now has do_not_specialize for EM and num_valid_tokens. Otherwise every different token count can trigger new compilation even after warmup.

Second part is actual warmup. New module fused_moe_warmup.py scans model for FusedMoE layers that use MoeWNA16Method, figures out which M values will hit Triton path based on the CUDA/Triton dispatch threshold, and calls quant_method.apply() with synthetic inputs.

One thing I had to be careful about: WNA16 dispatches two GEMMs with different top_k. Gate/up uses model's top_k but down projection uses top_k=1, so dispatch threshold is different for each. Both need separate warmup values.

For expert parallelism, only local expert IDs from expert_map are used. Layers with same weight shape and quant config get deduped so we don't repeat same compilation.

No full model forward pass. Just direct kernel level warmup.

Checked open PRs, didn't find existing one for this.

Test Plan

.venv/bin/python -m pytest tests/model_executor/test_fused_moe_warmup.py -v

pre-commit run ruff-format --files \
  vllm/model_executor/layers/fused_moe/fused_moe.py \
  vllm/model_executor/warmup/kernel_warmup.py \
  vllm/model_executor/warmup/fused_moe_warmup.py \
  tests/model_executor/test_fused_moe_warmup.py

pre-commit run ruff-check --files \
  vllm/model_executor/layers/fused_moe/fused_moe.py \
  vllm/model_executor/warmup/kernel_warmup.py \
  vllm/model_executor/warmup/fused_moe_warmup.py \
  tests/model_executor/test_fused_moe_warmup.py

pre-commit run mypy-3.10 --files \
  vllm/model_executor/layers/fused_moe/fused_moe.py \
  vllm/model_executor/warmup/kernel_warmup.py \
  vllm/model_executor/warmup/fused_moe_warmup.py \
  tests/model_executor/test_fused_moe_warmup.py \
  --hook-stage manual

Test Result

pytest: 6 passed, 16 warnings.
ruff-format, ruff-check, mypy: all passed.
Pre-commit hooks passed on commit.

Essential Elements of an Effective PR Description Checklist

AI assistance: Codex, Claude, Gemini.

Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a warmup mechanism for WNA16 MoE Triton kernels to ensure they are reliably exercised during initialization. It adds a new fused_moe_warmup.py module that calculates appropriate M values for dummy runs and integrates this into the kernel_warmup process. A review comment pointed out that the expert mapping logic should account for ROCm-specific binary masks to avoid including non-local experts during warmup.

Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Gemini <gemini-code-assist@users.noreply.github.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 · 2026-05-10T01:39:19Z

@ZJY0516 @qiching @tdoublep @vadiklyutiy Hi again, me from #42165.

Same area but different kernel this time. WNA16 fused MoE (fused_moe_kernel_gptq_awq) also has JIT compile problem in V1. Startup warmup batch is small enough that dispatch goes CUDA path, so Triton kernel is never compiled before JIT monitor starts. Then first bigger request compiles it during serving.

I added do_not_specialize for EM and num_valid_tokens to stop recompilation on different token counts, and a small kernel-level warmup that exercises the Triton path directly. Not a full model forward, just the MoE kernels.

Would appreciate your eyes on this since you know the warmup/monitor context from #40137. Thanks.

Warm up WNA16 MoE Triton kernels

26d9cd2

Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

mergify Bot added the bug Something isn't working label May 10, 2026

lesj0610 marked this pull request as ready for review May 10, 2026 01:19

lesj0610 requested review from mgoin and pavanimajety as code owners May 10, 2026 01:19

claude Bot reviewed May 10, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/model_executor/warmup/fused_moe_warmup.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][V1][MoE] Warm up WNA16 MoE Triton kernels#42193

[Bugfix][V1][MoE] Warm up WNA16 MoE Triton kernels#42193
lesj0610 wants to merge 2 commits into
vllm-project:mainfrom
lesj0610:lesj/wna16-moe-jit-warmup-20260510

lesj0610 commented May 10, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

lesj0610 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lesj0610 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

What I changed

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

lesj0610 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lesj0610 commented May 10, 2026 •

edited

Loading