-
Notifications
You must be signed in to change notification settings - Fork 5
feat: perf opt dma phase2 #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… and updating signatures for improved performance
…ze vector operations
… a mask parameter for improved block handling
… add HVX_VectorPred_x3 type alias
… improved clarity and type safety
… better memory efficiency
# Conflicts: # ggml/src/ggml-qnn/npu/device/dma_transfer.cpp # ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp
…ng and removing unused parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements Phase 2 of DMA-driven performance optimizations for the Hexagon NPU backend, building on previous DMA work to further improve memory access patterns and computational overlap. The changes focus on expanding DMA pipeline coverage to additional hot paths and improving buffering strategies.
Key changes include:
- Added support for 6-block processing (
load_hexa_block_genericfunction andHVX_Vector_x5type) - Refactored quantization block loading to use generalized
load_struct_into_vectorhelper - Improved DMA transfer initialization with templated
init_dma_transferfunction - Enhanced buffer management with better alignment and caching strategies for both quantized and non-quantized data paths
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
ggml/src/ggml-qnn/npu/device/vec_ops.hpp |
Added HVX_Vector_x5 type definition for 5-element vector packs |
ggml/src/ggml-qnn/npu/device/type_traits.cpp |
Enhanced quantization block loading with 6-block support and refactored existing dual/quad block functions |
ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp |
Refactored matrix multiplication implementation with improved DMA handling and unified buffer management |
ggml/src/ggml-qnn/npu/device/dma_transfer.cpp |
Changed DMA descriptor ordering from DESC_ORDER_ORDER to DESC_ORDER_NOORDER |
Related to #34
Related to #51
Overview
Key Changes
Performance
Test setup: 8gen2
Test suite:
test-backend-opsCompile:
llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequantSummary
q4_0 (improved at small-to-medium n)
q8_0 (large gains at small-to-medium n)
Notes
Unit tests
Test setup: 8gen2
Test suite:
test-backend-opsCompile:
llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequanttest-backend-ops-all.release.hexagon.dddef701a
Special thanks