Skip to content

Conversation

@chraac
Copy link
Owner

@chraac chraac commented Aug 25, 2025

Related to #34
Related to #51

Overview

  • Introduces a lookup-table (LUT) based dequantization path to reduce per-element arithmetic and speed up matmul on quantized weights.
  • Applies to low-bit quant formats where code values can be mapped via compact per-block LUTs (e.g., 4/5/8-bit).
  • Goal: higher tokens/s and lower latency, with no accuracy regression.

Key Changes

  • Added LUT generation per quant block using existing block metadata (e.g., scale/zero or equivalent).
  • Integrated a LUT-based dequant fast path for supported quant formats.
  • Safe fallback to legacy dequant when the format is unsupported or the LUT path is disabled.
  • Kept hot loops branch-light and memory-access contiguous to help vectorization and caching.
  • Added guards/tests to validate parity with the legacy path.

Performance

Test setup: 8gen2
Test suite: test-backend-ops

Matrix Type Baseline (0979133) Optimized (38935b6) Improvement
q4_0 16.52 GFLOPS 23.54 GFLOPS +42.5%
q4_K 0.196 GFLOPS 2.38 GFLOPS 12.2x
  • q4_K

    n Baseline (GFLOPS) Optimized (GFLOPS) Speedup
    1 0.20 2.38 12.2x
    2 0.39 4.61 11.8x
    3 0.58 6.69 11.5x
    4 0.78 8.65 11.1x
    5 0.97 10.49 10.8x
    8 1.53 15.39 10.1x
    512 35.86 56.07 1.56x
  • q4_0

    n Baseline (GFLOPS) Optimized (GFLOPS) Speedup
    1 16.52 23.54 1.43x
    2 26.82 36.15 1.35x
    3 33.65 43.65 1.30x
    4 38.50 48.55 1.26x
    5 42.16 52.00 1.23x
    8 48.65 57.67 1.19x
    512 55.85 58.62 1.05x

Notes

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

test-backend-ops-all.release.hexagon.38935b67e.7z

@chraac chraac requested a review from Copilot August 25, 2025 11:49
@chraac chraac self-assigned this Aug 25, 2025
@chraac chraac added enhancement New feature or request hexagon-npu labels Aug 25, 2025

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces significant optimizations to the QNN/NPU backend of llama.cpp, focusing on improving quantized tensor operations and overall system performance. The main purpose is to implement lookup table (LUT) based dequantization to reduce per-element arithmetic and accelerate matrix multiplication on quantized weights.

Key changes include:

  • Implementation of LUT-based dequantization for q4_0 and q4_K quantization formats
  • Power management optimizations and thread stack size configuration
  • Code refactoring and performance improvements across vector operations
  • Enhanced error handling and logging improvements

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl Adds thread stack size constant for NPU configuration
ggml/src/ggml-qnn/npu/host/util.hpp Declares new function for setting FastRPC stack size
ggml/src/ggml-qnn/npu/host/util.cpp Implements stack size configuration and improves code formatting
ggml/src/ggml-qnn/npu/host/tensor.hpp Refactors includes order and improves handle validation
ggml/src/ggml-qnn/npu/host/host_device.cpp Adds stack size configuration during device initialization
ggml/src/ggml-qnn/npu/host/graph.cpp Enhances debug logging for graph computation
ggml/src/ggml-qnn/npu/device/vec_ops.inl Optimizes vector operations with better instruction scheduling
ggml/src/ggml-qnn/npu/device/vec_ops.hpp Refactors vector type definitions and adds new math functions
ggml/src/ggml-qnn/npu/device/vec_math.inl Implements infinity guard functions for mathematical operations
ggml/src/ggml-qnn/npu/device/util.hpp Replaces FARF logging with custom logging functions
ggml/src/ggml-qnn/npu/device/type_traits.hpp Updates dequantization function signatures for LUT support
ggml/src/ggml-qnn/npu/device/type_traits.cpp Implements LUT-based dequantization for q4_0 and q4_K formats
ggml/src/ggml-qnn/npu/device/thread_pool.hpp Updates thread creation with proper stack size and logging
ggml/src/ggml-qnn/npu/device/tensor.hpp Improves logging format and error handling
ggml/src/ggml-qnn/npu/device/op_rope.cpp Fixes format specifier for int64 values
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Integrates LUT-based dequantization into matrix operations
ggml/src/ggml-qnn/npu/device/op_impl.cpp Refactors GLU operations into separate module
ggml/src/ggml-qnn/npu/device/op_glu.hpp New header for GLU operation declarations
ggml/src/ggml-qnn/npu/device/op_glu.cpp New implementation of GLU operations with improved error handling
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp Optimizes flash attention with template specialization for mask handling
ggml/src/ggml-qnn/npu/device/graph.cpp Adds documentation comment for thread synchronization
ggml/src/ggml-qnn/npu/device/device.cpp Integrates power management utilities and improves initialization
ggml/src/ggml-qnn/npu/CMakeLists.txt Updates build configuration to use shared libraries

Comment on lines +1176 to +1181
HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);

// TODO: fix the inf in div
HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);

out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out);
Copy link

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment indicates a known issue with infinity handling in the division function. The guard function is implemented as a workaround, but the underlying division function should be fixed to properly handle infinity values.

Suggested change
HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);
// TODO: fix the inf in div
HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);
out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out);
// Define a NaN constant (quiet NaN for IEEE 754 float)
const HVX_Vector nan = Q6_V_vsplat_R(0x7fc00000);
HVX_VectorPred pred_num_inf = Q6_Q_vcmp_eq_VwVw(num, inf);
HVX_VectorPred pred_denom_inf = Q6_Q_vcmp_eq_VwVw(denom, inf);
HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs);
// If denom == inf and num != inf, result is 0
HVX_VectorPred pred_denom_inf_num_not_inf = Q6_Q_vand_QQ(pred_denom_inf, Q6_Q_vnot_Q(pred_num_inf));
out = Q6_V_vmux_QVV(pred_denom_inf_num_not_inf, Q6_V_vzero(), out);
// If denom == inf and num == inf, result is NaN
HVX_VectorPred pred_both_inf = Q6_Q_vand_QQ(pred_denom_inf, pred_num_inf);
out = Q6_V_vmux_QVV(pred_both_inf, nan, out);
// If denom != inf and num == inf, result is inf
HVX_VectorPred pred_num_inf_denom_not_inf = Q6_Q_vand_QQ(pred_num_inf, Q6_Q_vnot_Q(pred_denom_inf));
out = Q6_V_vmux_QVV(pred_num_inf_denom_not_inf, inf, out);

Copilot uses AI. Check for mistakes.
@chraac chraac merged commit 5ef9b98 into dev-refactoring Aug 29, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request hexagon-npu

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants