feat: dequant use lut #55

chraac · 2025-08-25T11:49:40Z

Related to #34
Related to #51

Overview

Introduces a lookup-table (LUT) based dequantization path to reduce per-element arithmetic and speed up matmul on quantized weights.
Applies to low-bit quant formats where code values can be mapped via compact per-block LUTs (e.g., 4/5/8-bit).
Goal: higher tokens/s and lower latency, with no accuracy regression.

Key Changes

Added LUT generation per quant block using existing block metadata (e.g., scale/zero or equivalent).
Integrated a LUT-based dequant fast path for supported quant formats.
Safe fallback to legacy dequant when the format is unsupported or the LUT path is disabled.
Kept hot loops branch-light and memory-access contiguous to help vectorization and caching.
Added guards/tests to validate parity with the legacy path.

Performance

Test setup: 8gen2
Test suite: test-backend-ops

Matrix Type	Baseline (`0979133`)	Optimized (`38935b6`)	Improvement
q4_0	16.52 GFLOPS	23.54 GFLOPS	+42.5%
q4_K	0.196 GFLOPS	2.38 GFLOPS	12.2x

q4_K

n Baseline (GFLOPS) Optimized (GFLOPS) Speedup

1 0.20 2.38 12.2x

2 0.39 4.61 11.8x

3 0.58 6.69 11.5x

4 0.78 8.65 11.1x

5 0.97 10.49 10.8x

8 1.53 15.39 10.1x

512 35.86 56.07 1.56x

q4_0

n	Baseline (GFLOPS)	Optimized (GFLOPS)	Speedup
1	16.52	23.54	1.43x
2	26.82	36.15	1.35x
3	33.65	43.65	1.30x
4	38.50	48.55	1.26x
5	42.16	52.00	1.23x
8	48.65	57.67	1.19x
512	55.85	58.62	1.05x

Notes

Baseline log: test-backend-ops-perf-all.release.hexagon.0979133ea.log
Optimized log: test-backend-ops-perf-all.release.hexagon.38935b67e.log

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

test-backend-ops-all.release.hexagon.38935b67e.7z

…settings

…anagement

… and updating signatures for improved performance

…ze vector operations

… a mask parameter for improved block handling

… add HVX_VectorPred_x3 type alias

…improved performance

…ing block handling

…rformance

…dling in vec_ops

…proved vector loading

…ng and updating lookup methods

Copilot

Pull Request Overview

This PR introduces significant optimizations to the QNN/NPU backend of llama.cpp, focusing on improving quantized tensor operations and overall system performance. The main purpose is to implement lookup table (LUT) based dequantization to reduce per-element arithmetic and accelerate matrix multiplication on quantized weights.

Key changes include:

Implementation of LUT-based dequantization for q4_0 and q4_K quantization formats
Power management optimizations and thread stack size configuration
Code refactoring and performance improvements across vector operations
Enhanced error handling and logging improvements

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl`	Adds thread stack size constant for NPU configuration
`ggml/src/ggml-qnn/npu/host/util.hpp`	Declares new function for setting FastRPC stack size
`ggml/src/ggml-qnn/npu/host/util.cpp`	Implements stack size configuration and improves code formatting
`ggml/src/ggml-qnn/npu/host/tensor.hpp`	Refactors includes order and improves handle validation
`ggml/src/ggml-qnn/npu/host/host_device.cpp`	Adds stack size configuration during device initialization
`ggml/src/ggml-qnn/npu/host/graph.cpp`	Enhances debug logging for graph computation
`ggml/src/ggml-qnn/npu/device/vec_ops.inl`	Optimizes vector operations with better instruction scheduling
`ggml/src/ggml-qnn/npu/device/vec_ops.hpp`	Refactors vector type definitions and adds new math functions
`ggml/src/ggml-qnn/npu/device/vec_math.inl`	Implements infinity guard functions for mathematical operations
`ggml/src/ggml-qnn/npu/device/util.hpp`	Replaces FARF logging with custom logging functions
`ggml/src/ggml-qnn/npu/device/type_traits.hpp`	Updates dequantization function signatures for LUT support
`ggml/src/ggml-qnn/npu/device/type_traits.cpp`	Implements LUT-based dequantization for q4_0 and q4_K formats
`ggml/src/ggml-qnn/npu/device/thread_pool.hpp`	Updates thread creation with proper stack size and logging
`ggml/src/ggml-qnn/npu/device/tensor.hpp`	Improves logging format and error handling
`ggml/src/ggml-qnn/npu/device/op_rope.cpp`	Fixes format specifier for int64 values
`ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp`	Integrates LUT-based dequantization into matrix operations
`ggml/src/ggml-qnn/npu/device/op_impl.cpp`	Refactors GLU operations into separate module
`ggml/src/ggml-qnn/npu/device/op_glu.hpp`	New header for GLU operation declarations
`ggml/src/ggml-qnn/npu/device/op_glu.cpp`	New implementation of GLU operations with improved error handling
`ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp`	Optimizes flash attention with template specialization for mask handling
`ggml/src/ggml-qnn/npu/device/graph.cpp`	Adds documentation comment for thread synchronization
`ggml/src/ggml-qnn/npu/device/device.cpp`	Integrates power management utilities and improves initialization
`ggml/src/ggml-qnn/npu/CMakeLists.txt`	Updates build configuration to use shared libraries

ggml/src/ggml-qnn/npu/device/type_traits.cpp

ggml/src/ggml-qnn/npu/device/op_glu.cpp

ggml/src/ggml-qnn/npu/device/type_traits.cpp

Copilot · 2025-08-27T05:53:16Z

ggml/src/ggml-qnn/npu/device/vec_math.inl

+    HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);
+
+    // TODO: fix the inf in div
+    HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);
+
+    out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out);


This TODO comment indicates a known issue with infinity handling in the division function. The guard function is implemented as a workaround, but the underlying division function should be fixed to properly handle infinity values.

Suggested change

HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);

// TODO: fix the inf in div

HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);

out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out);

// Define a NaN constant (quiet NaN for IEEE 754 float)

const HVX_Vector nan = Q6_V_vsplat_R(0x7fc00000);

HVX_VectorPred pred_num_inf = Q6_Q_vcmp_eq_VwVw(num, inf);

HVX_VectorPred pred_denom_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);

HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);

// If denom == inf and num != inf, result is 0

HVX_VectorPred pred_denom_inf_num_not_inf = Q6_Q_vand_QQ(pred_denom_inf, Q6_Q_vnot_Q(pred_num_inf));

out = Q6_V_vmux_QVV(pred_denom_inf_num_not_inf, Q6_V_vzero(), out);

// If denom == inf and num == inf, result is NaN

HVX_VectorPred pred_both_inf = Q6_Q_vand_QQ(pred_denom_inf, pred_num_inf);

out = Q6_V_vmux_QVV(pred_both_inf, nan, out);

// If denom != inf and num == inf, result is inf

HVX_VectorPred pred_num_inf_denom_not_inf = Q6_Q_vand_QQ(pred_num_inf, Q6_Q_vnot_Q(pred_denom_inf));

out = Q6_V_vmux_QVV(pred_num_inf_denom_not_inf, inf, out);

ggml/src/ggml-qnn/npu/device/op_impl.cpp

chraac added 30 commits August 8, 2025 20:43

Add power management utilities to NPU device context and update DCVS …

8b7ce0b

…settings

Update DCVS settings in power_utils to use v3 API and enhance power m…

9a3cf62

…anagement

wip

5be7b0a

Enhance dequantization functions by adding load_dequant_table support…

39d0b70

… and updating signatures for improved performance

use lut

9063fd3

wip

9cf2c43

fix test failure

ccdf858

wip

94f7022

Refactor load_qual_block_generic to improve block handling and optimi…

50add7e

…ze vector operations

Enhance load_dual_block_generic and load_qual_block_generic to accept…

df55391

… a mask parameter for improved block handling

Refactor flash_attn_impl to optimize mask l2 prefetch

c60433f

wip

0ad08cc

wip

c502b4c

wip

c8985f5

wip

a600676

add log

3048e3e

link against shared libraries instead of static ones

669faa0

fix swiglu

85a082f

wip

e7ceb25

refactor expf_fix to handle overflow for different data types

4601d7f

enhance is_glu_op_supported to validate shapes for multiple sources

20a4ed2

wip

e56b2c1

refactor logging macros to use hexagon namespace and improve formatting

723e04a

fix printf format error

409cb28

wip

3802041

refactor: update static_assert messages for block size validation and…

c3771a8

… add HVX_VectorPred_x3 type alias

rename

e469b94

Merge branch 'dev-refactoring' into dev-quant-lut

0722d20

feat: enhance fa with mask

ba8c044

wip

6d1f5a8

chraac added 18 commits August 15, 2025 21:12

wip

7a0cd2f

feat: optimize block loading by refactoring scale index handling for …

a318bba

…improved performance

use Q6_Vb_vlut32_VbVbR_nomatch instead

818baa5

feat: enhance scale loading by adding static assertion and restructur…

f7c1b7c

…ing block handling

wip

cd349ce

feat: refactor vec_dot_product_mixed_impl for improved clarity and pe…

027a933

…rformance

wip

eeb4606

feat: simplify vector loading functions and improve alignment handling

20fb6c5

wip

6c3bc2d

feat: enhance scale loading mask with quantization block size validation

3694d50

wip

bdbf172

feat: implement make_scale_load_mask function and refactor vector han…

423acb7

…dling in vec_ops

feat: enhance load_dual_block_generic to include scale indices for im…

f9cc060

…proved vector loading

revert q8 dequant

36f1870

wip

9bba483

feat: optimize dequantization functions by removing unnecessary maski…

f0ca3e7

…ng and updating lookup methods

wip

9901ca0

Merge branch 'dev-refactoring' into dev-quant-lut

38935b6

chraac requested a review from Copilot August 25, 2025 11:49

chraac self-assigned this Aug 25, 2025

chraac added enhancement New feature or request hexagon-npu labels Aug 25, 2025

chraac added this to hexagon-npu backend Aug 25, 2025

This comment was marked as outdated.

Sign in to view

chraac mentioned this pull request Aug 25, 2025

feat: dequant use lut chraac/llama-cpp-qnn-builder#20

Merged

chraac requested a review from Copilot August 27, 2025 05:51

Copilot AI reviewed Aug 27, 2025

View reviewed changes

wip

e97b3c0

chraac merged commit 5ef9b98 into dev-refactoring Aug 29, 2025
1 check failed

github-project-automation bot moved this to Done in hexagon-npu backend Aug 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: dequant use lut #55

feat: dequant use lut #55

Uh oh!

chraac commented Aug 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

n	Baseline (GFLOPS)	Optimized (GFLOPS)	Speedup
1	0.20	2.38	12.2x
2	0.39	4.61	11.8x
3	0.58	6.69	11.5x
4	0.78	8.65	11.1x
5	0.97	10.49	10.8x
8	1.53	15.39	10.1x
512	35.86	56.07	1.56x

-    HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);
-    // TODO: fix the inf in div
-    HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);
-    out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out);
+    // Define a NaN constant (quiet NaN for IEEE 754 float)
+    const HVX_Vector nan = Q6_V_vsplat_R(0x7fc00000);
+    HVX_VectorPred pred_num_inf  = Q6_Q_vcmp_eq_VwVw(num, inf);
+    HVX_VectorPred pred_denom_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);
+    HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);
+    // If denom == inf and num != inf, result is 0
+    HVX_VectorPred pred_denom_inf_num_not_inf = Q6_Q_vand_QQ(pred_denom_inf, Q6_Q_vnot_Q(pred_num_inf));
+    out = Q6_V_vmux_QVV(pred_denom_inf_num_not_inf, Q6_V_vzero(), out);
+    // If denom == inf and num == inf, result is NaN
+    HVX_VectorPred pred_both_inf = Q6_Q_vand_QQ(pred_denom_inf, pred_num_inf);
+    out = Q6_V_vmux_QVV(pred_both_inf, nan, out);
+    // If denom != inf and num == inf, result is inf
+    HVX_VectorPred pred_num_inf_denom_not_inf = Q6_Q_vand_QQ(pred_num_inf, Q6_Q_vnot_Q(pred_denom_inf));
+    out = Q6_V_vmux_QVV(pred_num_inf_denom_not_inf, inf, out);

Uh oh!

feat: dequant use lut #55

feat: dequant use lut #55

Uh oh!

Conversation

chraac commented Aug 25, 2025

Overview

Key Changes

Performance

Unit tests

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants