- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Add experimental ggml-hexagon backend for the Hexagon NPU #16547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
 You can do backend-specific reorderings by implementing the graph_optimize function. | 
| 
 Yeah, I saw you guys added that function recently. Will check it out. | 
| Are the changes in  | 
| 
 Yep. Only for that. I'm going to play with the  | 
| You can try to reuse the  llama.cpp/ggml/src/ggml-metal/ggml-metal-common.h Lines 43 to 49 in 7049736 
 It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together. | 
4018db2    to
    e8e5407      
    Compare
  
    | 
 It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type. There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case? If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend? | 
e8e5407    to
    1d600de      
    Compare
  
    | 
 @ggerganov quick update. I can definitely use  | 
| 
 That's what I started with (ie no-repack buffers) but that makes it tricky to do partial offloads and introduces lots of copies because the scheduler thinks the buffers are not shareable/reusable. 
 In the previous SOC generations (Gen3, Gen4, X-Elite) the CPU does technically have more memory BW (a lot more in the X-Elite). In Gen5 and X2-Elite the BW is about the same. 
 Functionally, it's possible to use Q4_0, MXFP4, etc as is with HVX, unfortunately, it's about the worst layout for it :) It might be possible to share the same format if we were to redo ARM64 CPU_REPACK to use something like the above. It'd be absolutely awesome to have a common optimized format for CPU/GPU/NPU, ideally on-disk so that we mmap and don't repack at all like the original GGML, but it's very tricky in practice. | 
ad1d9b8    to
    027399b      
    Compare
  
    | @slaren @bandoti @CISC btw That  | 
| 
 It is at least ripe for some refactoring. | 
027399b    to
    ec2aaaa      
    Compare
  
    | 
 @ggerganov  @jeffbolznv Quick question for you guys. | 
| IMO the nodes should be treated as const. | 
| 
 Ok. Looks like Georgi agrees as well :) | 
| @slaren Let me know if you had any more thoughts on the REPACK buffers. We could revisit things and come up with a better solution in the followup. I keep thinking it be nice if we could just keep everything as a host buffer and somehow enforce  I made good progress on enabling Qualcomm Device Cloud (aka QDC) where we can run CI jobs on physical devices. | 
beb50c8    to
    55ef9c8      
    Compare
  
    Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead.
…e tensors Same asserts as the CPU backend.
GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels.
this should handle all failure paths in the session allocation.
4c16180    to
    3e4ff73      
    Compare
  
    | Hmm. Not sure why arm64 ubuntu builds are failing with missing GGML_F32_STEP. | 
| Don't worry about the build failing, it is unrelated. | 
…6547) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <[email protected]> Co-authored-by: Todor Boinovski <[email protected]>
…6547) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <[email protected]> Co-authored-by: Todor Boinovski <[email protected]>
This PR introduces a new experimental backend
ggml-hexagonwith support for the Hexagon NPU.Highlights:
Note: This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
Please see (docs/backend/hexagon/README.md) for build and basic usage info.
Also (docs/backend/hexagon/developer.md) for some notes on things like buffer management, etc.
Tested with the following models:
Known-issues:
HTP-REPACKbuffers (see some notes/questions below)Future-work:
@slaren
It'd be good to make
extra_buffers_typea bit better exposed/integrated. I started thinking thatdevice_get_buffer_typeshould probably just return a list of buffer types. Currently only the CPU backend needed those extra bufs but now the Hexagon needs them too (ie all buffers are actually normal host buffers from CPU perspective and extras are needed only to force the REPACK). I added basic support for that in the model loader (separate commit here in the PR) and was going to sprinkle some more but we should discuss if there is perhaps a better way to handle this.I included some wrapper scripts (under docs/backends/hexagon) to make running stuff over ADB easier. Let me know if we should put them in some other directory. They are a bit Snapdragon specific because of the env vars needed to find Hexagon/HTP libraries (described in the developer.md).
@ggerganov
I included a commit that groups the Attention and FFN MATMUL Ops.
Like this:
This allows us to easily reuse dynamically quantized
attn_norm-27. Hexagon/HTP backend places quantized tensors in VTCM (basically a SW managed cache) where subsequent Ops can reuse it.This change doesn't seem to have any affect on other backends (I did some basic checks with CPU, OpenCL, CUDA, and Metal). Please let me know what you think. Perhaps, you have suggestions for how to do this better.
Marking as a Draft for now because I'm working on enabling the CI.
Otherwise all required bits and pieces are ready to go.
Some outputs of the Hexagon Backend in action on my Galaxy S25+