optimized qmoe code path for 1 token by guschmue · Pull Request #27383 · microsoft/onnxruntime

guschmue · 2026-02-18T23:19:18Z

avoids gpu -> cpu copy in qmoe and removes 1 of 6 shaders in qmoe.
This improves token generation on gpt-oss-20b by ~15%

Copilot

Pull request overview

This PR optimizes the QMoE (Quantized Mixture of Experts) operator for WebGPU by introducing a fast path for single-token inference, which avoids GPU-to-CPU data copies and reduces the number of shader dispatches. The optimization achieves approximately 15% improvement in token generation performance for gpt-oss-20b.

Changes:

Added weight_index_indirect tensor parameter to enable indirect expert indexing on GPU
Introduced specialized shaders (gate_1token.wgsl.template, final_mix_1token.wgsl.template) for the 1-token case
Modified all matmul_nbits shader variants to support indirect weight indexing via has_weight_idx_indirect parameter
Increased max_tokens from 512 to 2048 for better batching performance
Cleaned up QMoEFinalMixProgram by removing redundant used_by uniform variable

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
onnxruntime/contrib_ops/webgpu/moe/qmoe.cc	Added 1-token optimization path and updated multi-token path
onnxruntime/contrib_ops/webgpu/moe/gate_1token.wgsl.template	New shader for computing top-k experts for single token
onnxruntime/contrib_ops/webgpu/moe/final_mix_1token.wgsl.template	New shader for mixing expert outputs for single token
onnxruntime/contrib_ops/webgpu/moe/gate.wgsl.template	Renamed input from hidden_state to router_logits for clarity
onnxruntime/contrib_ops/webgpu/moe/final_mix.wgsl.template	Updated comment to reflect uniform variable changes
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_apple.wgsl.template	Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_wide_tile.wgsl.template	Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.wgsl.template	Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_small_m.wgsl.template	Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template	Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h/cc	Added has_weight_idx_indirect parameter and weight_index_indirect input
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h/cc	Added has_weight_idx_indirect parameter and weight_index_indirect input
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_nbits.h/cc	Added has_weight_idx_indirect parameter and weight_index_indirect input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

guschmue added 5 commits February 13, 2026 18:06

add weight_index_indirect to nbitmm

acabc51

optimized qmoe code pass for 1 token

41701c8

missing file

7e4b875

add missing file

aa177a1

Merge branch 'main' into gs/wgpu-qmoe-opt

327763e

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 18, 2026

guschmue requested a review from Copilot February 18, 2026 23:23

Copilot started reviewing on behalf of guschmue February 18, 2026 23:24 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

guschmue and others added 2 commits February 19, 2026 08:21

Update onnxruntime/contrib_ops/webgpu/moe/gate_1token.wgsl.template

89c648a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/contrib_ops/webgpu/moe/final_mix.wgsl.template

957e65f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

guschmue marked this pull request as ready for review February 19, 2026 16:59

fs-eire approved these changes Feb 26, 2026

View reviewed changes

guschmue merged commit 5f087c4 into main Feb 26, 2026
90 checks passed

guschmue deleted the gs/wgpu-qmoe-opt branch February 26, 2026 00:59

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized qmoe code path for 1 token#27383

optimized qmoe code path for 1 token#27383
guschmue merged 7 commits intomainfrom
gs/wgpu-qmoe-opt

guschmue commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guschmue commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants