QMoE CPU Performance Update (Up to 4x on 4-bit) by tianleiwu · Pull Request #27364 · microsoft/onnxruntime

tianleiwu · 2026-02-17T03:12:02Z

Summary

This change improves QMoE CPU performance by moving more work to prepack time and enabling the DirectQ4 GEMM fast path where appropriate, while preserving an env-var switch for performance/accuracy A/B testing.

This PR introduces:

Prepack and cache infrastructure for QMoE expert weights.
DirectQ4 packed-B cache built during prepack (instead of mutable runtime cache in Compute()).
Fast-path support for block-wise cases (including block size 32 where supported by MLAS Q4 type).
Runtime toggle via ORT_USE_MLAS_Q4_GEMM_MOE.
Default fast-path policy refined to avoid known accuracy-loss scenarios unless explicitly overridden by env var.
Test and benchmark refinements for QMoE CPU validation.

Key Implementation Changes

1. Prepack-time cache build

Moves DirectQ4 packed-B cache construction to prepack stage.
Removes mutable runtime cache maintenance from Compute().
Reduces per-inference overhead and avoids mutable shared cache complexity.

2. Fast path vs fallback

Keeps two execution modes:
- DirectQ4 GEMM fast path (MlasQ4GemmPackB + DirectQ4Gemm cache usage).
- Fallback path (DequantizePrePacked + MlasGemm).
Allows controlled fallback for accuracy-sensitive configurations.

3. Environment variable behavior

ORT_USE_MLAS_Q4_GEMM_MOE=1: force fast path when supported.
ORT_USE_MLAS_Q4_GEMM_MOE=0: force fallback path.
Unset: use default policy that enables fast path unless a known accuracy-loss pattern is detected.

4. Test updates

QMoE CPU tests were refined to validate env-var on/off behavior and no-env behavior.
Coverage includes parity checks for symmetric/asymmetric, row-wise/block-wise settings.

Benchmark Results (1000 inferences, `benchmark_qmoe.py`)

Note: PyTorch latency fluctuates across runs and is excluded from conclusions below.

ORT results comparison

Config	Baseline ORT Time (ms)	Baseline ORT tok/s	New ORT Time (env=0) (ms)	New ORT tok/s (env=0)	New ORT Time (env=1) (ms)	New ORT tok/s (env=1)
Medium-4bit	748.594	1.3	237.219	4.2	178.943	5.6
Medium-8bit	209.277	4.8	212.074	4.7	203.882	4.9

ORT speedup vs baseline

Config	env=0 speedup vs baseline (time)	env=1 speedup vs baseline (time)
Medium-4bit	3.16x faster	4.18x faster
Medium-8bit	0.99x (about flat)	1.03x faster

Accuracy Notes

env=1 (forced fast path) provides the best 4-bit performance but may show non-zero max diff in known cases.
env=0 (fallback) maintains parity behavior with zero observed max diff in the reported benchmark table.
Default no-env policy is designed to avoid known accuracy-loss cases while still enabling fast path where safe.

Copilot

Pull request overview

This PR introduces prepack-time optimization for QMoE (Quantized Mixture of Experts) CPU operations, achieving up to 4x performance improvement for 4-bit quantization by moving weight preprocessing and DirectQ4 GEMM cache building from runtime to initialization time. The implementation adds environment variable controls for A/B testing between fast-path (DirectQ4) and fallback (dequantize + MlasGemm) execution modes.

Changes:

Implements PrePack and UseSharedPrePackedBuffers for weight unpacking and cache building at initialization
Adds ORT_USE_MLAS_Q4_GEMM_MOE environment variable for runtime path selection with smart defaults
Extends test coverage to validate all execution modes (env=0, env=1, no env) for 4-bit configurations
Updates attribute naming from swiglu_interleaved to swiglu_fusion for consistency with operator conventions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test_qmoe_cpu.py	Adds test expansion for env var modes, updates swiglu_fusion attribute, adds bias support, removes debug code
benchmark_qmoe.py	New benchmark file for measuring QMoE throughput across configurations
mlas_q4.h	Clarifies documentation for MlasQ4GemmPackB expected data layout ([K, N] format)
debug_node_inputs_outputs_utils.cc	Updates debug message to indicate pre-packed tensors may have missing type info
moe_quantization_cpu.h	Adds PrePack/UseSharedPrePackedBuffers methods, cache storage, and env var control flags
moe_quantization_cpu.cc	Implements weight prepacking, DirectQ4 cache building, environment variable handling, and dual execution paths
moe_helper.h	Adds nullable weight tensor handling for prepacked weights with fallback logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/cpu/moe/moe_helper.h

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h

…u_q4_perf

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/test/python/transformers/benchmark_qmoe.py

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

## Summary This change improves QMoE CPU performance by moving more work to prepack time and enabling the DirectQ4 GEMM fast path where appropriate, while preserving an env-var switch for performance/accuracy A/B testing. This PR introduces: - Prepack and cache infrastructure for QMoE expert weights. - DirectQ4 packed-B cache built during prepack (instead of mutable runtime cache in `Compute()`). - Fast-path support for block-wise cases (including block size 32 where supported by MLAS Q4 type). - Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`. - Default fast-path policy refined to avoid known accuracy-loss scenarios unless explicitly overridden by env var. - Test and benchmark refinements for QMoE CPU validation. ## Key Implementation Changes ### 1. Prepack-time cache build - Moves DirectQ4 packed-B cache construction to prepack stage. - Removes mutable runtime cache maintenance from `Compute()`. - Reduces per-inference overhead and avoids mutable shared cache complexity. ### 2. Fast path vs fallback - Keeps two execution modes: - DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache usage). - Fallback path (`DequantizePrePacked` + `MlasGemm`). - Allows controlled fallback for accuracy-sensitive configurations. ### 3. Environment variable behavior - `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported. - `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path. - Unset: use default policy that enables fast path unless a known accuracy-loss pattern is detected. ### 4. Test updates - QMoE CPU tests were refined to validate env-var on/off behavior and no-env behavior. - Coverage includes parity checks for symmetric/asymmetric, row-wise/block-wise settings. ## Benchmark Results (1000 inferences, `benchmark_qmoe.py`) Note: PyTorch latency fluctuates across runs and is excluded from conclusions below. ### ORT results comparison | Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time (env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New ORT tok/s (env=1) | |---|---:|---:|---:|---:|---:|---:| | Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 | | Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 | ### ORT speedup vs baseline | Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline (time) | |---|---:|---:| | Medium-4bit | 3.16x faster | 4.18x faster | | Medium-8bit | 0.99x (about flat) | 1.03x faster | ## Accuracy Notes - `env=1` (forced fast path) provides the best 4-bit performance but may show non-zero max diff in known cases. - `env=0` (fallback) maintains parity behavior with zero observed max diff in the reported benchmark table. - Default no-env policy is designed to avoid known accuracy-loss cases while still enabling fast path where safe.

This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | decd177 | #27090 | Fix GatherND division by zero when batch dimensions mismatch | | 55f8234 | #27360 | Fix QMoE CPU Operator | | df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and ukernel interface | | 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided | | b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit) | | f501e1d | #27413 | Fix refcount bug in map input conversion that caused shutdown segfault | | b32b205 | #27421 | Fix error where bytes is not assigned for dynamic qgemm pack b size | | 426b006 | #27397 | Fix DllImportResolver | | 0982844 | #27412 | MatmulNBits prepacking scales fix | | 9afb0d2 | #27430 | Fix validation for external data paths for models loaded from bytes | | 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies | | 79e0676 | #27419 | fix: out of bounds access for resize operation | | 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D | | 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write | | cf96123 | #27411 | [web] fix usage of wasmBinary together with a blob URL for .mjs | | 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor" warning. | | ffbbc4f | #27316 | Build Windows ARM64X binaries as part of packaging pipeline | --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com> Co-authored-by: don <70039285+0-don@users.noreply.github.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Chaya <cha182350@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Erik <erscor@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

tianleiwu added 5 commits February 16, 2026 19:30

Fix QMoE CPU

ed0478a

prepack and cache to improve QMoE perf

2c83829

move cache to prepack; allow q4gemm for block_size=32

2b25601

Turn on q4 gemm by default if there is no accuracy loss

e9dcd41

refine tests

dec8867

tianleiwu requested review from apsonawane and Copilot February 17, 2026 03:14

Copilot started reviewing on behalf of tianleiwu February 17, 2026 03:14 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

tianleiwu added release:1.24.2 and removed release:1.24.2 labels Feb 17, 2026

tianleiwu added 5 commits February 17, 2026 07:45

refactor CheckInputs

4af1315

fix build

23cdf2d

robust shape storage

6d351fe

Merge remote-tracking branch 'origin/main' into tlwu/20260216/qmoe_cp…

100738b

…u_q4_perf

refactoring

0b4c783

apsonawane reviewed Feb 20, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

apsonawane reviewed Feb 20, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

apsonawane reviewed Feb 20, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h Show resolved Hide resolved

review feedback

abef62b

tianleiwu requested review from apsonawane and Copilot February 20, 2026 20:15

Copilot started reviewing on behalf of tianleiwu February 20, 2026 20:16 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

onnxruntime/test/python/transformers/benchmark_qmoe.py Show resolved Hide resolved

onnxruntime/test/python/transformers/benchmark_qmoe.py Show resolved Hide resolved

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

address AI feedback

11d30ce

apsonawane approved these changes Feb 20, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) February 20, 2026 23:24

tianleiwu merged commit b2a6e69 into main Feb 21, 2026
89 of 90 checks passed

tianleiwu deleted the tlwu/20260216/qmoe_cpu_q4_perf branch February 21, 2026 08:46

tianleiwu added the release:1.24.3 label Feb 27, 2026

tianleiwu mentioned this pull request Feb 27, 2026

ORT 1.24.3 release cherry pick round 1 #27476

Merged

tianleiwu removed the release:1.24.3 label Feb 28, 2026

This was referenced Mar 9, 2026

Bump Microsoft.ML.OnnxRuntime.Gpu from 1.23.2 to 1.24.3 yuniko-software/bge-m3-onnx#66

Closed

deps(nuget): Bump the microsoft-packages group with 2 updates Ellerbach/azure-ai-search-simulator#73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE CPU Performance Update (Up to 4x on 4-bit)#27364

QMoE CPU Performance Update (Up to 4x on 4-bit)#27364
tianleiwu merged 12 commits intomainfrom
tlwu/20260216/qmoe_cpu_q4_perf

tianleiwu commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Feb 17, 2026

Summary

Key Implementation Changes

1. Prepack-time cache build

2. Fast path vs fallback

3. Environment variable behavior

4. Test updates

Benchmark Results (1000 inferences, benchmark_qmoe.py)

ORT results comparison

ORT speedup vs baseline

Accuracy Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Benchmark Results (1000 inferences, `benchmark_qmoe.py`)