Skip to content

Fix QMoE CPU Operator#27360

Merged
tianleiwu merged 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu
Feb 18, 2026
Merged

Fix QMoE CPU Operator#27360
tianleiwu merged 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation.

Changes

1. QMoE CPU Operator Fixes

  • Corrected Bias Handling: Renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using DirectQ4Gemm.
  • SwiGLU Attribute Update: Switched from swiglu_interleaved to swiglu_fusion in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards.

2. MLAS Documentation

  • Clarified Buffer Shapes: Added explicit documentation to MlasQ4GemmPackB to specify that the input FpData buffer expects a shape of [K, N]. This helps prevent layout-related errors in future integrations.

3. Test Updates

  • PyTorch Parity Fixes: Refactored onnxruntime/test/python/transformers/test_qmoe_cpu.py to use swiglu_fusion and improved the test structure for better parity checks with PyTorch.

Verification

  • Verified by running test_qmoe_cpu.py to ensure all QMoE parity tests pass on CPU.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in the QMoE CPU operator implementation, specifically correcting bias handling logic and updating attribute naming to match the actual C++ implementation. The changes also improve MLAS documentation for better clarity on input buffer layout requirements.

Changes:

  • Fixed FC2 bias handling in QMoE CPU operator by tracking when MLAS DirectQ4Gemm adds bias
  • Added transpose logic to convert weight matrices from [N, K] to [K, N] layout required by MlasQ4GemmPackB
  • Updated Python tests to use swiglu_fusion attribute instead of incorrect swiglu_interleaved attribute
  • Enhanced MLAS documentation to clarify that MlasQ4GemmPackB expects FpData with shape [K, N]
  • Added proper bias collection and passing in Python test infrastructure

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/mlas/inc/mlas_q4.h Updated documentation for MlasQ4GemmPackB to clarify FpData shape [K, N] and parameter meanings
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Added transpose logic for weight matrices, renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas, removed unused fc1_used_direct_q4 flag
onnxruntime/test/python/transformers/test_qmoe_cpu.py Migrated from swiglu_interleaved to swiglu_fusion attribute, added bias collection/passing logic, updated swiglu function signature, improved weight interleaving for swiglu_fusion=1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu tianleiwu merged commit 55f8234 into main Feb 18, 2026
96 checks passed
@tianleiwu tianleiwu deleted the tlwu/20260216/fix_qmoe_cpu branch February 18, 2026 18:40
tianleiwu added a commit that referenced this pull request Feb 27, 2026
This PR addresses several issues in the QMoE CPU implementation,
improves MLAS documentation.

## Changes

### 1. QMoE CPU Operator Fixes
- **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to
`fc2_bias_added_by_mlas` and updated the logic to consistently track
whether FC2 bias has been applied. This ensures that bias is not
double-counted or missed when using `DirectQ4Gemm`.
- **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to
`swiglu_fusion` in both the C++ operator and the Python test
infrastructure to align with the latest QMoE implementation standards.

### 2. MLAS Documentation
- **Clarified Buffer Shapes**: Added explicit documentation to
`MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a
shape of `[K, N]`. This helps prevent layout-related errors in future
integrations.

### 3. Test Updates
- **PyTorch Parity Fixes**: Refactored
`onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use
`swiglu_fusion` and improved the test structure for better parity checks
with PyTorch.

## Verification
- Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests
pass on CPU.
tianleiwu added a commit that referenced this pull request Feb 27, 2026
This cherry-picks the following commits for the release:

| Commit ID | PR Number | Commit Title |
|-----------|-----------|-------------|
| decd177 | #27090 | Fix GatherND division by zero when batch
dimensions mismatch |
| 55f8234 | #27360 | Fix QMoE CPU Operator |
| df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and
ukernel interface |
| 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL
workers when wasmBinary is provided |
| b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit)
|
| f501e1d | #27413 | Fix refcount bug in map input conversion that
caused shutdown segfault |
| b32b205 | #27421 | Fix error where bytes is not assigned for
dynamic qgemm pack b size |
| 426b006 | #27397 | Fix DllImportResolver |
| 0982844 | #27412 | MatmulNBits prepacking scales fix |
| 9afb0d2 | #27430 | Fix validation for external data paths for
models loaded from bytes |
| 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies |
| 79e0676 | #27419 | fix: out of bounds access for resize operation |
| 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied
when gamma/beta are not 1D |
| 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to
Heap Out-of-Bounds Read/Write |
| cf96123 | #27411 | [web] fix usage of wasmBinary together with a
blob URL for .mjs |
| 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor"
warning. |
| ffbbc4f | #27316 | Build Windows ARM64X binaries as part of
packaging pipeline |

---------

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com>
Co-authored-by: don <70039285+0-don@users.noreply.github.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Chaya <cha182350@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Erik <erscor@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants