Skip to content

[Perf] Gemma4: output int64 from fused routing kernel to avoid redundant dtype copy#40565

Closed
yintong-lu wants to merge 2 commits into
vllm-project:mainfrom
yintong-lu:gemma4-routing-int64
Closed

[Perf] Gemma4: output int64 from fused routing kernel to avoid redundant dtype copy#40565
yintong-lu wants to merge 2 commits into
vllm-project:mainfrom
yintong-lu:gemma4-routing-int64

Conversation

@yintong-lu
Copy link
Copy Markdown
Contributor

Summary:
This PR is a furthur optimization based on PR [https://github.com//pull/39083].
The Gemma4 fused Triton routing kernel (#39083) outputs topk_ids as int32, but all downstream consumers require int64:

  • remap_hidden_states C++ kernel uses int64 indexing
  • cpu_fused_moe.py: topk_ids.to(torch.int64) for scatter
  • gpt_oss_triton_kernels_moe.py: topk_ids_raw.to(torch.long)
  • compressed_tensors_moe: topk_ids.to(torch.long)

This mismatch causes ~30 unnecessary aten::copy_ calls per forward pass (one int32→int64 conversion per MoE layer), which becomes a measurable bottleneck on bandwidth-constrained platforms.

Performance (Intel XPU × 4, TP=4, Gemma4-26B-A4B-it)

Relative to PR #39083 baseline (commit 45232a454):

Scenario copy_ calls copy_ XPU% Throughput TPOT
Prefill c=1 176 → 116 (−60) 5.67% → 1.18% −4.9% (noise)
Prefill c=16 174 → 114 (−60) — → 1.19% +4.3%
Decode c=1 1,706 → 1,106 (−600) — → 1.99% +1.2% −1.6%
Decode c=16 1,011 → 651 (−360) — → 1.38% +5.1% −5.8%

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@yintong-lu
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the routing logic in custom_routing_router.py to handle topk_ids dtypes and updates the Gemma4 model to use int64 for routing IDs. A review comment suggests refining the dtype conversion in custom_routing_router.py to ensure that the requested indices_type is strictly respected, which prevents redundant memory copies and potential type inconsistencies in downstream functions.

Comment thread vllm/model_executor/layers/fused_moe/router/custom_routing_router.py Outdated
if topk_ids.dtype != target_dtype and topk_ids.dtype != torch.int64:
topk_ids = topk_ids.to(target_dtype)

return topk_weights.to(torch.float32), topk_ids
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it suitable for other cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I would evaluate on CUDA and see if there's regression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been verified that on CUDA, there is no enhancement as the dtype conversions are fused into captured graph nodes (CompiledFxGraph) rather than appearing as standalone aten::copy_ calls.
On CUDA, no regression is observed either.

@yintong-lu
Copy link
Copy Markdown
Contributor Author

@claude review

@xinyu-intel
Copy link
Copy Markdown
Contributor

@jikunshang can you review?

Signed-off-by: Yintong <yintongl@DUT7604BMGFRD.fm.intel.com>
Made-with: Cursor
Signed-off-by: yintong-lu <yintong.lu@intel.com>
Signed-off-by: yintong-lu <yintong.lu@intel.com>
@yintong-lu yintong-lu force-pushed the gemma4-routing-int64 branch from 2fa6dc5 to 2901451 Compare May 11, 2026 06:31
@yintong-lu yintong-lu requested a review from zyongye as a code owner May 11, 2026 06:31
@yintong-lu yintong-lu closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants