[Perf] Gemma4: output int64 from fused routing kernel to avoid redundant dtype copy by yintong-lu · Pull Request #40565 · vllm-project/vllm

yintong-lu · 2026-04-22T02:17:19Z

Summary:
This PR is a furthur optimization based on PR [https://github.com//pull/39083].
The Gemma4 fused Triton routing kernel (#39083) outputs topk_ids as int32, but all downstream consumers require int64:

remap_hidden_states C++ kernel uses int64 indexing
cpu_fused_moe.py: topk_ids.to(torch.int64) for scatter
gpt_oss_triton_kernels_moe.py: topk_ids_raw.to(torch.long)
compressed_tensors_moe: topk_ids.to(torch.long)

This mismatch causes ~30 unnecessary aten::copy_ calls per forward pass (one int32→int64 conversion per MoE layer), which becomes a measurable bottleneck on bandwidth-constrained platforms.

Performance (Intel XPU × 4, TP=4, Gemma4-26B-A4B-it)

Relative to PR #39083 baseline (commit 45232a454):

Scenario	`copy_` calls	`copy_` XPU%	Throughput	TPOT
Prefill c=1	176 → 116 (−60)	5.67% → 1.18%	−4.9% (noise)	—
Prefill c=16	174 → 114 (−60)	— → 1.19%	+4.3%	—
Decode c=1	1,706 → 1,106 (−600)	— → 1.99%	+1.2%	−1.6%
Decode c=16	1,011 → 651 (−360)	— → 1.38%	+5.1%	−5.8%

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

yintong-lu · 2026-04-22T02:18:37Z

@claude review

gemini-code-assist

Code Review

This pull request modifies the routing logic in custom_routing_router.py to handle topk_ids dtypes and updates the Gemma4 model to use int64 for routing IDs. A review comment suggests refining the dtype conversion in custom_routing_router.py to ensure that the requested indices_type is strictly respected, which prevents redundant memory copies and potential type inconsistencies in downstream functions.

xinyu-intel · 2026-04-22T03:14:43Z

+        if topk_ids.dtype != target_dtype and topk_ids.dtype != torch.int64:
+            topk_ids = topk_ids.to(target_dtype)
+
+        return topk_weights.to(torch.float32), topk_ids


is it suitable for other cases?

I think so. I would evaluate on CUDA and see if there's regression.

It has been verified that on CUDA, there is no enhancement as the dtype conversions are fused into captured graph nodes (CompiledFxGraph) rather than appearing as standalone aten::copy_ calls.
On CUDA, no regression is observed either.

yintong-lu · 2026-04-24T08:01:11Z

@claude review

xinyu-intel · 2026-04-24T23:20:26Z

@jikunshang can you review?

Signed-off-by: Yintong <yintongl@DUT7604BMGFRD.fm.intel.com> Made-with: Cursor Signed-off-by: yintong-lu <yintong.lu@intel.com>

Signed-off-by: yintong-lu <yintong.lu@intel.com>

yintong-lu requested review from mgoin and pavanimajety as code owners April 22, 2026 02:17

claude Bot reviewed Apr 22, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/router/custom_routing_router.py Outdated

xinyu-intel reviewed Apr 22, 2026

View reviewed changes

kailashbuki mentioned this pull request Apr 30, 2026

[Kernel] Gemma4 MoE decode GEMV optimization — up to 46% TPOT improvement at BS=1-8 #41379

Closed

7 tasks

yintong-lu added 2 commits May 11, 2026 06:31

v1

af4fe5a

Signed-off-by: Yintong <yintongl@DUT7604BMGFRD.fm.intel.com> Made-with: Cursor Signed-off-by: yintong-lu <yintong.lu@intel.com>

add corner case check

2901451

Signed-off-by: yintong-lu <yintong.lu@intel.com>

yintong-lu force-pushed the gemma4-routing-int64 branch from 2fa6dc5 to 2901451 Compare May 11, 2026 06:31

yintong-lu requested a review from zyongye as a code owner May 11, 2026 06:31

yintong-lu closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Gemma4: output int64 from fused routing kernel to avoid redundant dtype copy#40565

[Perf] Gemma4: output int64 from fused routing kernel to avoid redundant dtype copy#40565
yintong-lu wants to merge 2 commits into
vllm-project:mainfrom
yintong-lu:gemma4-routing-int64

yintong-lu commented Apr 22, 2026

Uh oh!

claude Bot left a comment

Uh oh!

yintong-lu commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

xinyu-intel Apr 22, 2026

Uh oh!

yintong-lu Apr 22, 2026

Uh oh!

yintong-lu Apr 24, 2026

Uh oh!

yintong-lu commented Apr 24, 2026

Uh oh!

xinyu-intel commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yintong-lu commented Apr 22, 2026

Performance (Intel XPU × 4, TP=4, Gemma4-26B-A4B-it)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

yintong-lu commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

xinyu-intel Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

yintong-lu Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

yintong-lu Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

yintong-lu commented Apr 24, 2026

Uh oh!

xinyu-intel commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants