Improve Pre Packing for 2 bit LUT kernels#27131
Conversation
There was a problem hiding this comment.
Pull request overview
Improves SQNBit LUT GEMM pre-packing performance by routing weight/scales packing through new AVX2-optimized implementations and adds microbenchmarks and expanded unit coverage (including M=1).
Changes:
- Add AVX2 dispatch entry points for quantized-B packing and scales/zero-point packing, and route
MlasLutGemmPackthrough them. - Introduce
onnxruntime_mlas_benchmarkcases for LUT GEMM pack and compute performance. - Relax short-execute test gating and add
M=1LUT GEMM test cases.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/mlas/unittest/test_sqlutgemm.cpp | Allows M=1 short-execute runs by relaxing the gating condition and adds M=1 test cases. |
| onnxruntime/test/mlas/bench/bench_lutgemm.cpp | Adds benchmarks for LUT GEMM packing and compute paths with configurable args. |
| onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp | Adds AVX2 implementations for weight packing and scales/ZP packing and wires them into the AVX2 LUT dispatch. |
| onnxruntime/core/mlas/lib/qlutgemm.h | Extends the LUT dispatch struct and defines new function pointer signatures for packing entry points. |
| onnxruntime/core/mlas/lib/qlutgemm.cpp | Refactors scalar pack logic into dispatch calls and threads scales/ZP packing through the thread pool. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… LUT GEMM functions
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ecessary tail bytes
…ons by processing tiles of input values.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # onnxruntime/core/mlas/lib/qlutgemm.cpp # onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp
Description
This PR improves the pre-packing performance for SQNBitGemm LUT (Lookup Table) GEMM operations by replacing scalar implementations with AVX2-optimized kernels, and adds benchmarking infrastructure to measure performance.
AVX2 Optimized Weight Packing
PackQuantBData_avx2()- AVX2 optimized weight packing that performs bit-plane decomposition and multi-reshape/transpose operations using SIMD instructionsPackScalesAndZeroPoints_avx2()- AVX2 optimized scales and zero points packing with template specialization forHasZeroPointcasesMlasLutGenKernelAvx2dispatch structureRefactored Dispatch Architecture
qlutgemm.cppto dispatch-based architectureMLAS_QNBIT_LUT_PACK_QUANTB_DATAandMLAS_QNBIT_LUT_PACK_SCALES_AND_ZPMLAS_QNBIT_LUT_GEMM_DISPATCHstructure withPackQuantBDataandPackScalesAndZeroPointsfunction pointersLutPackScalesAndZeroPoints()Benchmarking
LUTGEMM_PACKbenchmark for measuring weight packing performanceLUTGEMM_COMPUTEbenchmark for measuring GEMM compute performanceBlkLen,M,N,K,Threads,HasZeroPointTest Updates
M < BlkLen || N < BlkLentoN < BlkLento allowM=1casesM=1configurations (1x128x128,1x1024x1024)