Skip to content

optimized qmoe code path for 1 token#27383

Merged
guschmue merged 7 commits intomainfrom
gs/wgpu-qmoe-opt
Feb 26, 2026
Merged

optimized qmoe code path for 1 token#27383
guschmue merged 7 commits intomainfrom
gs/wgpu-qmoe-opt

Conversation

@guschmue
Copy link
Copy Markdown
Contributor

avoids gpu -> cpu copy in qmoe and removes 1 of 6 shaders in qmoe.
This improves token generation on gpt-oss-20b by ~15%

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 18, 2026
@guschmue guschmue requested a review from Copilot February 18, 2026 23:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the QMoE (Quantized Mixture of Experts) operator for WebGPU by introducing a fast path for single-token inference, which avoids GPU-to-CPU data copies and reduces the number of shader dispatches. The optimization achieves approximately 15% improvement in token generation performance for gpt-oss-20b.

Changes:

  • Added weight_index_indirect tensor parameter to enable indirect expert indexing on GPU
  • Introduced specialized shaders (gate_1token.wgsl.template, final_mix_1token.wgsl.template) for the 1-token case
  • Modified all matmul_nbits shader variants to support indirect weight indexing via has_weight_idx_indirect parameter
  • Increased max_tokens from 512 to 2048 for better batching performance
  • Cleaned up QMoEFinalMixProgram by removing redundant used_by uniform variable

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
onnxruntime/contrib_ops/webgpu/moe/qmoe.cc Added 1-token optimization path and updated multi-token path
onnxruntime/contrib_ops/webgpu/moe/gate_1token.wgsl.template New shader for computing top-k experts for single token
onnxruntime/contrib_ops/webgpu/moe/final_mix_1token.wgsl.template New shader for mixing expert outputs for single token
onnxruntime/contrib_ops/webgpu/moe/gate.wgsl.template Renamed input from hidden_state to router_logits for clarity
onnxruntime/contrib_ops/webgpu/moe/final_mix.wgsl.template Updated comment to reflect uniform variable changes
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_apple.wgsl.template Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_wide_tile.wgsl.template Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.wgsl.template Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_small_m.wgsl.template Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template Added support for indirect weight indexing
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h/cc Added has_weight_idx_indirect parameter and weight_index_indirect input
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h/cc Added has_weight_idx_indirect parameter and weight_index_indirect input
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_nbits.h/cc Added has_weight_idx_indirect parameter and weight_index_indirect input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/webgpu/moe/gate_1token.wgsl.template Outdated
Comment thread onnxruntime/contrib_ops/webgpu/moe/qmoe.cc
Comment thread onnxruntime/contrib_ops/webgpu/moe/final_mix_1token.wgsl.template
Comment thread onnxruntime/contrib_ops/webgpu/moe/final_mix.wgsl.template Outdated
Comment thread onnxruntime/contrib_ops/webgpu/moe/qmoe.cc
guschmue and others added 2 commits February 19, 2026 08:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@guschmue guschmue marked this pull request as ready for review February 19, 2026 16:59
@guschmue guschmue merged commit 5f087c4 into main Feb 26, 2026
90 checks passed
@guschmue guschmue deleted the gs/wgpu-qmoe-opt branch February 26, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants