-
Couldn't load subscription status.
- Fork 5
feat: dequant use lut #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… and updating signatures for improved performance
…ze vector operations
… a mask parameter for improved block handling
… add HVX_VectorPred_x3 type alias
…improved performance
…ing block handling
…proved vector loading
…ng and updating lookup methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces significant optimizations to the QNN/NPU backend of llama.cpp, focusing on improving quantized tensor operations and overall system performance. The main purpose is to implement lookup table (LUT) based dequantization to reduce per-element arithmetic and accelerate matrix multiplication on quantized weights.
Key changes include:
- Implementation of LUT-based dequantization for q4_0 and q4_K quantization formats
- Power management optimizations and thread stack size configuration
- Code refactoring and performance improvements across vector operations
- Enhanced error handling and logging improvements
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl |
Adds thread stack size constant for NPU configuration |
ggml/src/ggml-qnn/npu/host/util.hpp |
Declares new function for setting FastRPC stack size |
ggml/src/ggml-qnn/npu/host/util.cpp |
Implements stack size configuration and improves code formatting |
ggml/src/ggml-qnn/npu/host/tensor.hpp |
Refactors includes order and improves handle validation |
ggml/src/ggml-qnn/npu/host/host_device.cpp |
Adds stack size configuration during device initialization |
ggml/src/ggml-qnn/npu/host/graph.cpp |
Enhances debug logging for graph computation |
ggml/src/ggml-qnn/npu/device/vec_ops.inl |
Optimizes vector operations with better instruction scheduling |
ggml/src/ggml-qnn/npu/device/vec_ops.hpp |
Refactors vector type definitions and adds new math functions |
ggml/src/ggml-qnn/npu/device/vec_math.inl |
Implements infinity guard functions for mathematical operations |
ggml/src/ggml-qnn/npu/device/util.hpp |
Replaces FARF logging with custom logging functions |
ggml/src/ggml-qnn/npu/device/type_traits.hpp |
Updates dequantization function signatures for LUT support |
ggml/src/ggml-qnn/npu/device/type_traits.cpp |
Implements LUT-based dequantization for q4_0 and q4_K formats |
ggml/src/ggml-qnn/npu/device/thread_pool.hpp |
Updates thread creation with proper stack size and logging |
ggml/src/ggml-qnn/npu/device/tensor.hpp |
Improves logging format and error handling |
ggml/src/ggml-qnn/npu/device/op_rope.cpp |
Fixes format specifier for int64 values |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp |
Integrates LUT-based dequantization into matrix operations |
ggml/src/ggml-qnn/npu/device/op_impl.cpp |
Refactors GLU operations into separate module |
ggml/src/ggml-qnn/npu/device/op_glu.hpp |
New header for GLU operation declarations |
ggml/src/ggml-qnn/npu/device/op_glu.cpp |
New implementation of GLU operations with improved error handling |
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp |
Optimizes flash attention with template specialization for mask handling |
ggml/src/ggml-qnn/npu/device/graph.cpp |
Adds documentation comment for thread synchronization |
ggml/src/ggml-qnn/npu/device/device.cpp |
Integrates power management utilities and improves initialization |
ggml/src/ggml-qnn/npu/CMakeLists.txt |
Updates build configuration to use shared libraries |
| HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf); | ||
|
|
||
| // TODO: fix the inf in div | ||
| HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs); | ||
|
|
||
| out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out); |
Copilot
AI
Aug 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment indicates a known issue with infinity handling in the division function. The guard function is implemented as a workaround, but the underlying division function should be fixed to properly handle infinity values.
| HVX_VectorPred pred_inf = Q6_Q_vcmp_eq_VwVw(denom, inf); | |
| // TODO: fix the inf in div | |
| HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs); | |
| out = Q6_V_vmux_QVV(pred_inf, Q6_V_vzero(), out); | |
| // Define a NaN constant (quiet NaN for IEEE 754 float) | |
| const HVX_Vector nan = Q6_V_vsplat_R(0x7fc00000); | |
| HVX_VectorPred pred_num_inf = Q6_Q_vcmp_eq_VwVw(num, inf); | |
| HVX_VectorPred pred_denom_inf = Q6_Q_vcmp_eq_VwVw(denom, inf); | |
| HVX_Vector out = qhmath_hvx_div_vf(num, denom, coeffs); | |
| // If denom == inf and num != inf, result is 0 | |
| HVX_VectorPred pred_denom_inf_num_not_inf = Q6_Q_vand_QQ(pred_denom_inf, Q6_Q_vnot_Q(pred_num_inf)); | |
| out = Q6_V_vmux_QVV(pred_denom_inf_num_not_inf, Q6_V_vzero(), out); | |
| // If denom == inf and num == inf, result is NaN | |
| HVX_VectorPred pred_both_inf = Q6_Q_vand_QQ(pred_denom_inf, pred_num_inf); | |
| out = Q6_V_vmux_QVV(pred_both_inf, nan, out); | |
| // If denom != inf and num == inf, result is inf | |
| HVX_VectorPred pred_num_inf_denom_not_inf = Q6_Q_vand_QQ(pred_num_inf, Q6_Q_vnot_Q(pred_denom_inf)); | |
| out = Q6_V_vmux_QVV(pred_num_inf_denom_not_inf, inf, out); |
Related to #34
Related to #51
Overview
Key Changes
Performance
Test setup: 8gen2
Test suite:
test-backend-opsq4_K
q4_0
Notes
Unit tests
Test setup: 8gen2
Test suite:
test-backend-opstest-backend-ops-all.release.hexagon.38935b67e.7z