feat: perf opt dma phase2 #57

chraac · 2025-10-01T03:45:07Z

Related to #34
Related to #51

Overview

Phase 2 of the DMA-driven performance work builds on top of PR feat: perf opt dma #56 to further reduce memory stalls and improve overlap between data movement and compute.

Key Changes

Broadened DMA pipeline coverage to additional hot paths (dense and selected quantized/packing routines), with generalized descriptor helpers.
Buffering improvements:
- Tunable double-/triple-buffering per kernel to increase overlap at small n.
- More alignment-friendly tiling to reduce bank conflicts and improve scratch utilization.

Performance

Test setup: 8gen2
Test suite: test-backend-ops
Compile: llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequant

Summary

Shape for tables below: m=4096, k=14336.

q4_0 (improved at small-to-medium n)

n	Baseline (GFLOPS)	Phase 2 (GFLOPS)	Δ (%)
1	22.51	26.71	+18.7%
2	33.84	40.61	+20.0%
3	41.07	48.55	+18.2%
4	46.03	53.79	+16.9%
5	49.43	57.55	+16.4%
8	55.78	64.14	+15.0%

q8_0 (large gains at small-to-medium n)

n	Baseline (GFLOPS)	Phase 2 (GFLOPS)	Δ (%)
1	3.85	7.93	+106.0%
2	7.21	14.47	+101.0%
3	10.29	19.92	+93.6%
4	13.04	24.55	+88.3%
5	15.46	28.52	+84.5%
8	21.82	37.56	+72.1%

Notes

Baseline log (dev-refactoring): test-backend-ops-perf-all.release.hexagon.3994a9b7d.log
Optimized log (dev-perf-opt-dma-phase2): test-backend-ops-perf-all.release.hexagon.dddef701a.log

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops
Compile: llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequant

14476/14476 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

test-backend-ops-all.release.hexagon.dddef701a

Special thanks

Thanks to haozixu for the DMA approach discussion and haozixu/htp-ops-lib.

…settings

…anagement

… and updating signatures for improved performance

…ze vector operations

… a mask parameter for improved block handling

… add HVX_VectorPred_x3 type alias

… improved clarity and type safety

… better memory efficiency

# Conflicts: # ggml/src/ggml-qnn/npu/device/dma_transfer.cpp # ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp

…ck handling

…ng and removing unused parameters

Copilot

Pull Request Overview

This PR implements Phase 2 of DMA-driven performance optimizations for the Hexagon NPU backend, building on previous DMA work to further improve memory access patterns and computational overlap. The changes focus on expanding DMA pipeline coverage to additional hot paths and improving buffering strategies.

Key changes include:

Added support for 6-block processing (load_hexa_block_generic function and HVX_Vector_x5 type)
Refactored quantization block loading to use generalized load_struct_into_vector helper
Improved DMA transfer initialization with templated init_dma_transfer function
Enhanced buffer management with better alignment and caching strategies for both quantized and non-quantized data paths

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`ggml/src/ggml-qnn/npu/device/vec_ops.hpp`	Added `HVX_Vector_x5` type definition for 5-element vector packs
`ggml/src/ggml-qnn/npu/device/type_traits.cpp`	Enhanced quantization block loading with 6-block support and refactored existing dual/quad block functions
`ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp`	Refactored matrix multiplication implementation with improved DMA handling and unified buffer management
`ggml/src/ggml-qnn/npu/device/dma_transfer.cpp`	Changed DMA descriptor ordering from `DESC_ORDER_ORDER` to `DESC_ORDER_NOORDER`

chraac added 30 commits August 8, 2025 20:43

Add power management utilities to NPU device context and update DCVS …

8b7ce0b

…settings

Update DCVS settings in power_utils to use v3 API and enhance power m…

9a3cf62

…anagement

wip

5be7b0a

Enhance dequantization functions by adding load_dequant_table support…

39d0b70

… and updating signatures for improved performance

use lut

9063fd3

wip

9cf2c43

fix test failure

ccdf858

wip

94f7022

Refactor load_qual_block_generic to improve block handling and optimi…

50add7e

…ze vector operations

Enhance load_dual_block_generic and load_qual_block_generic to accept…

df55391

… a mask parameter for improved block handling

Refactor flash_attn_impl to optimize mask l2 prefetch

c60433f

wip

0ad08cc

wip

c502b4c

wip

c8985f5

wip

a600676

add log

3048e3e

link against shared libraries instead of static ones

669faa0

fix swiglu

85a082f

wip

e7ceb25

refactor expf_fix to handle overflow for different data types

4601d7f

enhance is_glu_op_supported to validate shapes for multiple sources

20a4ed2

wip

e56b2c1

refactor logging macros to use hexagon namespace and improve formatting

723e04a

fix printf format error

409cb28

wip

3802041

refactor: update static_assert messages for block size validation and…

c3771a8

… add HVX_VectorPred_x3 type alias

rename

e469b94

Merge branch 'dev-refactoring' into dev-quant-lut

0722d20

feat: enhance fa with mask

ba8c044

wip

6d1f5a8

chraac added 19 commits September 22, 2025 22:27

wip

3287d9b

wip

70201dc

wip

9f72399

fix: enhance mul_mat_impl for improved cache handling and clarity

afa541c

fix: refactor tensor unflattening and DMA transfer initialization for…

eef1600

… improved clarity and type safety

fix: improve cache handling of quant

d68d4d9

wip

3358958

fix: improve cache handling in mul_mat_impl and mul_mat_gemv_impl for…

bc24f03

… better memory efficiency

rename

27d1851

Merge branch 'dev-refactoring' into dev-perf-opt-dma-phase2

32927fd

# Conflicts: # ggml/src/ggml-qnn/npu/device/dma_transfer.cpp # ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp

add load_hexa_block_generic

b575da9

wip

e47a73c

extract dequant block into separated function

ae70488

refactor: enhance dequantization functions with table parameter

70a6551

fix load_dual_block_generic

c85c1bb

refactor: rename dequantization functions for clarity and enhance blo…

be4e34c

…ck handling

refactor: simplify dequantization logic by consolidating block handli…

47dbfc2

…ng and removing unused parameters

wip

5791c51

wip

dddef70

chraac self-assigned this Oct 1, 2025

chraac added this to hexagon-npu backend Oct 1, 2025

chraac added the enhancement New feature or request label Oct 1, 2025

chraac requested a review from Copilot October 1, 2025 03:45

chraac moved this to In progress in hexagon-npu backend Oct 1, 2025

Copilot AI reviewed Oct 1, 2025

View reviewed changes

chraac added the hexagon-npu label Oct 1, 2025

chraac mentioned this pull request Oct 1, 2025

feat: perf opt dma phase2 chraac/llama-cpp-qnn-builder#22

Merged

chraac merged commit e1727af into dev-refactoring Oct 5, 2025

github-project-automation bot moved this from In progress to Done in hexagon-npu backend Oct 5, 2025

chraac mentioned this pull request Oct 12, 2025

feat: perf opt gemv phase2 #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt dma phase2 #57

feat: perf opt dma phase2 #57

Uh oh!

chraac commented Oct 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: perf opt dma phase2 #57

feat: perf opt dma phase2 #57

Uh oh!

Conversation

chraac commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Performance

Summary

q4_0 (improved at small-to-medium n)

q8_0 (large gains at small-to-medium n)

Unit tests

Special thanks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chraac commented Oct 1, 2025 •

edited

Loading