Improve Pre Packing for 2 bit LUT kernels by vraspar · Pull Request #27131 · microsoft/onnxruntime

vraspar · 2026-01-24T02:46:05Z

Description

This PR improves the pre-packing performance for SQNBitGemm LUT (Lookup Table) GEMM operations by replacing scalar implementations with AVX2-optimized kernels, and adds benchmarking infrastructure to measure performance.

AVX2 Optimized Weight Packing

Added PackQuantBData_avx2() - AVX2 optimized weight packing that performs bit-plane decomposition and multi-reshape/transpose operations using SIMD instructions
Added PackScalesAndZeroPoints_avx2() - AVX2 optimized scales and zero points packing with template specialization for HasZeroPoint cases
Registered new dispatch functions in MlasLutGenKernelAvx2 dispatch structure

Refactored Dispatch Architecture

Moved complex scalar packing logic from qlutgemm.cpp to dispatch-based architecture
Added new function signatures: MLAS_QNBIT_LUT_PACK_QUANTB_DATA and MLAS_QNBIT_LUT_PACK_SCALES_AND_ZP
Extended MLAS_QNBIT_LUT_GEMM_DISPATCH structure with PackQuantBData and PackScalesAndZeroPoints function pointers
Added thread pool support to LutPackScalesAndZeroPoints()

Benchmarking

Added LUTGEMM_PACK benchmark for measuring weight packing performance
Added LUTGEMM_COMPUTE benchmark for measuring GEMM compute performance
Configurable parameters: BlkLen, M, N, K, Threads, HasZeroPoint

Test Updates

Relaxed constraint from M < BlkLen || N < BlkLen to N < BlkLen to allow M=1 cases
Added test cases for M=1 configurations (1x128x128, 1x1024x1024)

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

Improves SQNBit LUT GEMM pre-packing performance by routing weight/scales packing through new AVX2-optimized implementations and adds microbenchmarks and expanded unit coverage (including M=1).

Changes:

Add AVX2 dispatch entry points for quantized-B packing and scales/zero-point packing, and route MlasLutGemmPack through them.
Introduce onnxruntime_mlas_benchmark cases for LUT GEMM pack and compute performance.
Relax short-execute test gating and add M=1 LUT GEMM test cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/mlas/unittest/test_sqlutgemm.cpp	Allows `M=1` short-execute runs by relaxing the gating condition and adds `M=1` test cases.
onnxruntime/test/mlas/bench/bench_lutgemm.cpp	Adds benchmarks for LUT GEMM packing and compute paths with configurable args.
onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp	Adds AVX2 implementations for weight packing and scales/ZP packing and wires them into the AVX2 LUT dispatch.
onnxruntime/core/mlas/lib/qlutgemm.h	Extends the LUT dispatch struct and defines new function pointer signatures for packing entry points.
onnxruntime/core/mlas/lib/qlutgemm.cpp	Refactors scalar pack logic into dispatch calls and threads scales/ZP packing through the thread pool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… LUT GEMM functions

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ecessary tail bytes

…ons by processing tiles of input values.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Conflicts: # onnxruntime/core/mlas/lib/qlutgemm.cpp # onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

jambayk and others added 2 commits January 22, 2026 17:17

Add AVX2 LUT weight packing for SQNBitGemm

f520ad4

lut profiling

b52a947

vraspar requested a review from Copilot January 24, 2026 02:46

Copilot started reviewing on behalf of vraspar January 24, 2026 02:46 View session

github-advanced-security AI found potential problems Jan 24, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_lutgemm.cpp Fixed

github-actions Bot reviewed Jan 24, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_lutgemm.cpp Outdated

Comment thread onnxruntime/test/mlas/bench/bench_lutgemm.cpp Outdated

Copilot AI reviewed Jan 24, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

Comment thread onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/qlutgemm.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/qlutgemm.cpp

Fix AVX2 dispatch error handling and improve memory initialization in…

0f5025a

… LUT GEMM functions

vraspar requested a review from Copilot January 26, 2026 21:35

vraspar marked this pull request as ready for review January 26, 2026 21:35

Copilot started reviewing on behalf of vraspar January 26, 2026 21:35 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated

vraspar added 2 commits January 26, 2026 15:40

PackQuantBData_avx2: Instead of entire buffer, zero-initialize only n…

70cb824

…ecessary tail bytes

Change parallelization in PackQuantBData_avx2 to prevent race conditi…

6d1f6c1

…ons by processing tiles of input values.

vraspar requested review from Copilot, hariharans29 and jambayk February 2, 2026 19:48

Copilot AI reviewed Feb 2, 2026

View reviewed changes

vraspar and others added 2 commits February 3, 2026 11:16

Update comment

d213a75

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Cleanup

a3ef325

vraspar requested a review from Copilot February 4, 2026 19:25

Copilot started reviewing on behalf of vraspar February 4, 2026 19:26 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/qlutgemm.cpp

Comment thread onnxruntime/core/mlas/lib/qlutgemm.cpp

Comment thread onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

hariharans29 previously approved these changes Feb 13, 2026

View reviewed changes

vraspar enabled auto-merge (squash) February 23, 2026 22:42

Merge remote-tracking branch 'origin/main' into vraspar/lut-profiling

a018692

# Conflicts: # onnxruntime/core/mlas/lib/qlutgemm.cpp # onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

vraspar dismissed hariharans29’s stale review via a018692 February 23, 2026 23:42

vraspar disabled auto-merge February 23, 2026 23:42

vraspar enabled auto-merge (squash) February 23, 2026 23:42

jambayk approved these changes Feb 24, 2026

View reviewed changes

hariharans29 approved these changes Feb 24, 2026

View reviewed changes

vraspar merged commit 2145c8c into main Feb 25, 2026
106 of 114 checks passed

vraspar deleted the vraspar/lut-profiling branch February 25, 2026 07:06

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

vraspar commented Jan 24, 2026

Description

AVX2 Optimized Weight Packing

Refactored Dispatch Architecture

Benchmarking

Test Updates

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants