Skip to content

Conversation

@chraac
Copy link
Owner

@chraac chraac commented Oct 1, 2025

Related to #34
Related to #51

Overview

  • Phase 2 of the DMA-driven performance work builds on top of PR feat: perf opt dma #56 to further reduce memory stalls and improve overlap between data movement and compute.

Key Changes

  • Broadened DMA pipeline coverage to additional hot paths (dense and selected quantized/packing routines), with generalized descriptor helpers.
  • Buffering improvements:
    • Tunable double-/triple-buffering per kernel to increase overlap at small n.
    • More alignment-friendly tiling to reduce bank conflicts and improve scratch utilization.

Performance

Test setup: 8gen2
Test suite: test-backend-ops
Compile: llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequant

Summary

  • Shape for tables below: m=4096, k=14336.

q4_0 (improved at small-to-medium n)

n Baseline (GFLOPS) Phase 2 (GFLOPS) Δ (%)
1 22.51 26.71 +18.7%
2 33.84 40.61 +20.0%
3 41.07 48.55 +18.2%
4 46.03 53.79 +16.9%
5 49.43 57.55 +16.4%
8 55.78 64.14 +15.0%

q8_0 (large gains at small-to-medium n)

n Baseline (GFLOPS) Phase 2 (GFLOPS) Δ (%)
1 3.85 7.93 +106.0%
2 7.21 14.47 +101.0%
3 10.29 19.92 +93.6%
4 13.04 24.55 +88.3%
5 15.46 28.52 +84.5%
8 21.82 37.56 +72.1%

Notes

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops
Compile: llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r --hexagon-npu-only --enable-dequant

14476/14476 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

test-backend-ops-all.release.hexagon.dddef701a

Special thanks

@chraac chraac self-assigned this Oct 1, 2025
@chraac chraac added the enhancement New feature or request label Oct 1, 2025
@chraac chraac requested a review from Copilot October 1, 2025 03:45
@chraac chraac moved this to In progress in hexagon-npu backend Oct 1, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements Phase 2 of DMA-driven performance optimizations for the Hexagon NPU backend, building on previous DMA work to further improve memory access patterns and computational overlap. The changes focus on expanding DMA pipeline coverage to additional hot paths and improving buffering strategies.

Key changes include:

  • Added support for 6-block processing (load_hexa_block_generic function and HVX_Vector_x5 type)
  • Refactored quantization block loading to use generalized load_struct_into_vector helper
  • Improved DMA transfer initialization with templated init_dma_transfer function
  • Enhanced buffer management with better alignment and caching strategies for both quantized and non-quantized data paths

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
ggml/src/ggml-qnn/npu/device/vec_ops.hpp Added HVX_Vector_x5 type definition for 5-element vector packs
ggml/src/ggml-qnn/npu/device/type_traits.cpp Enhanced quantization block loading with 6-block support and refactored existing dual/quad block functions
ggml/src/ggml-qnn/npu/device/op/op_mul_mat.cpp Refactored matrix multiplication implementation with improved DMA handling and unified buffer management
ggml/src/ggml-qnn/npu/device/dma_transfer.cpp Changed DMA descriptor ordering from DESC_ORDER_ORDER to DESC_ORDER_NOORDER

@chraac chraac merged commit e1727af into dev-refactoring Oct 5, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in hexagon-npu backend Oct 5, 2025
@chraac chraac mentioned this pull request Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request hexagon-npu

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant