Fix QMoE CPU Operator by tianleiwu · Pull Request #27360 · microsoft/onnxruntime

tianleiwu · 2026-02-16T19:32:04Z

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation.

Changes

1. QMoE CPU Operator Fixes

Corrected Bias Handling: Renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using DirectQ4Gemm.
SwiGLU Attribute Update: Switched from swiglu_interleaved to swiglu_fusion in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards.

2. MLAS Documentation

Clarified Buffer Shapes: Added explicit documentation to MlasQ4GemmPackB to specify that the input FpData buffer expects a shape of [K, N]. This helps prevent layout-related errors in future integrations.

3. Test Updates

PyTorch Parity Fixes: Refactored onnxruntime/test/python/transformers/test_qmoe_cpu.py to use swiglu_fusion and improved the test structure for better parity checks with PyTorch.

Verification

Verified by running test_qmoe_cpu.py to ensure all QMoE parity tests pass on CPU.

Copilot

Pull request overview

This PR fixes issues in the QMoE CPU operator implementation, specifically correcting bias handling logic and updating attribute naming to match the actual C++ implementation. The changes also improve MLAS documentation for better clarity on input buffer layout requirements.

Changes:

Fixed FC2 bias handling in QMoE CPU operator by tracking when MLAS DirectQ4Gemm adds bias
Added transpose logic to convert weight matrices from [N, K] to [K, N] layout required by MlasQ4GemmPackB
Updated Python tests to use swiglu_fusion attribute instead of incorrect swiglu_interleaved attribute
Enhanced MLAS documentation to clarify that MlasQ4GemmPackB expects FpData with shape [K, N]
Added proper bias collection and passing in Python test infrastructure

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
onnxruntime/core/mlas/inc/mlas_q4.h	Updated documentation for MlasQ4GemmPackB to clarify FpData shape [K, N] and parameter meanings
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc	Added transpose logic for weight matrices, renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas, removed unused fc1_used_direct_q4 flag
onnxruntime/test/python/transformers/test_qmoe_cpu.py	Migrated from swiglu_interleaved to swiglu_fusion attribute, added bias collection/passing logic, updated swiglu function signature, improved weight interleaving for swiglu_fusion=1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation. ## Changes ### 1. QMoE CPU Operator Fixes - **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to `fc2_bias_added_by_mlas` and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using `DirectQ4Gemm`. - **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to `swiglu_fusion` in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards. ### 2. MLAS Documentation - **Clarified Buffer Shapes**: Added explicit documentation to `MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a shape of `[K, N]`. This helps prevent layout-related errors in future integrations. ### 3. Test Updates - **PyTorch Parity Fixes**: Refactored `onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use `swiglu_fusion` and improved the test structure for better parity checks with PyTorch. ## Verification - Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests pass on CPU.

This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | decd177 | #27090 | Fix GatherND division by zero when batch dimensions mismatch | | 55f8234 | #27360 | Fix QMoE CPU Operator | | df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and ukernel interface | | 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided | | b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit) | | f501e1d | #27413 | Fix refcount bug in map input conversion that caused shutdown segfault | | b32b205 | #27421 | Fix error where bytes is not assigned for dynamic qgemm pack b size | | 426b006 | #27397 | Fix DllImportResolver | | 0982844 | #27412 | MatmulNBits prepacking scales fix | | 9afb0d2 | #27430 | Fix validation for external data paths for models loaded from bytes | | 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies | | 79e0676 | #27419 | fix: out of bounds access for resize operation | | 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D | | 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write | | cf96123 | #27411 | [web] fix usage of wasmBinary together with a blob URL for .mjs | | 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor" warning. | | ffbbc4f | #27316 | Build Windows ARM64X binaries as part of packaging pipeline | --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com> Co-authored-by: don <70039285+0-don@users.noreply.github.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Chaya <cha182350@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Erik <erscor@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

Fix QMoE CPU

ed0478a

tianleiwu requested review from apsonawane and Copilot February 16, 2026 19:32

Copilot started reviewing on behalf of tianleiwu February 16, 2026 19:32 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

tianleiwu added release:1.24.2 and removed release:1.24.2 labels Feb 17, 2026

apsonawane approved these changes Feb 18, 2026

View reviewed changes

tianleiwu merged commit 55f8234 into main Feb 18, 2026
96 checks passed

tianleiwu deleted the tlwu/20260216/fix_qmoe_cpu branch February 18, 2026 18:40

tianleiwu added the release:1.24.3 label Feb 27, 2026

tianleiwu mentioned this pull request Feb 27, 2026

ORT 1.24.3 release cherry pick round 1 #27476

Merged

tianleiwu removed the release:1.24.3 label Feb 28, 2026

This was referenced Mar 9, 2026

Bump Microsoft.ML.OnnxRuntime.Gpu from 1.23.2 to 1.24.3 yuniko-software/bge-m3-onnx#66

Closed

deps(nuget): Bump the microsoft-packages group with 2 updates Ellerbach/azure-ai-search-simulator#73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix QMoE CPU Operator#27360

Fix QMoE CPU Operator#27360
tianleiwu merged 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu

tianleiwu commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Feb 16, 2026

Changes

1. QMoE CPU Operator Fixes

2. MLAS Documentation

3. Test Updates

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants