Skip to content

Conversation

@max-krasnyansky
Copy link
Collaborator

@max-krasnyansky max-krasnyansky commented Oct 13, 2025

This PR introduces a new experimental backend ggml-hexagon with support for the Hexagon NPU.

Highlights:

  • Supports Hexagon versions: v73, v75, v79, and v81
  • Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
  • Supports Q4_0, Q8_0, MXFP4, and FP32 data types
  • Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
  • Minimal build dependencies (just needs Android NDK and Hexagon-SDK Community Edition)

Note: This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Please see (docs/backend/hexagon/README.md) for build and basic usage info.
Also (docs/backend/hexagon/developer.md) for some notes on things like buffer management, etc.

Tested with the following models:

  • Llama-3.2-1B-Instruct-Q4_0.gguf
  • Llama-3.2-3B-Instruct-Q4_0.gguf
  • Llama-3.1-8B-Instruct-Q4_0.gguf
  • Qwen3-4B-Q4_0.gguf
  • Qwen3-8B-128K-Q4_0.gguf
  • Qwen3-14B-128K-Q4_0.gguf
  • LFM2-1.2B-Q4_0.gguf
  • OLMoE-1B-7B-0125-Instruct-Q4_0.gguf
  • gpt-oss-20b-mxfp4.gguf & gpt-oss-20b.mxfp4-q4_0.gguf (requires latest devices with 16+ GB of DDR)

Known-issues:

  • Test-backend-ops failures for supported Ops. There are a few corner-cases that we don't handle yet which doesn't affect the models listed above. We will follow up with fixes for that.
  • Tensor-override option needs some updates to work with the HTP-REPACK buffers (see some notes/questions below)
  • Integration (buffer sharing, etc) with the OpenCL/Adreno backend needs work

Future-work:

  • More optimizations (kernels, fusion, etc), more Ops (SHORTCONV, etc), more DataTypes
  • Better integration with the OpenCL/Adreno backend
  • Support for Windows on Snapdragon devices

@slaren
It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types. Currently only the CPU backend needed those extra bufs but now the Hexagon needs them too (ie all buffers are actually normal host buffers from CPU perspective and extras are needed only to force the REPACK). I added basic support for that in the model loader (separate commit here in the PR) and was going to sprinkle some more but we should discuss if there is perhaps a better way to handle this.

I included some wrapper scripts (under docs/backends/hexagon) to make running stuff over ADB easier. Let me know if we should put them in some other directory. They are a bit Snapdragon specific because of the env vars needed to find Hexagon/HTP libraries (described in the developer.md).


@ggerganov
I included a commit that groups the Attention and FFN MATMUL Ops.
Like this:

node #841 ( MUL_MAT):  Qcur-27 (   1M) [ HTP0 ] use=1: blk.27.attn_q.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #842 ( MUL_MAT):  Kcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_k.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #843 ( MUL_MAT):  Vcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_v.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #845 (    ROPE):  Qcur-27 (   1M) [ HTP0 ] use=1:   Qcur-27 (reshaped) [ HTP0 ] HTP0#leaf_6#0 [ NULL ] 

This allows us to easily reuse dynamically quantized attn_norm-27. Hexagon/HTP backend places quantized tensors in VTCM (basically a SW managed cache) where subsequent Ops can reuse it.
This change doesn't seem to have any affect on other backends (I did some basic checks with CPU, OpenCL, CUDA, and Metal). Please let me know what you think. Perhaps, you have suggestions for how to do this better.


Marking as a Draft for now because I'm working on enabling the CI.
Otherwise all required bits and pieces are ready to go.


Some outputs of the Hexagon Backend in action on my Galaxy S25+
~/src/llama.cpp-hexagon$ M=../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
...
ggml-hex: allocating new registry : ndev 1
ggml-hex: HTP arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007974ffcd90
build: 6733 (6a8cf8914) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   225.49 MiB
load_tensors:         HTP0 model buffer size =     0.26 MiB
load_tensors:  HTP0-REPACK model buffer size =   504.00 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache:       HTP0 KV buffer size =   136.00 MiB
llama_kv_cache: size =  136.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):   68.00 MiB, V (q8_0):   68.00 MiB
llama_context:       HTP0 compute buffer size =    15.00 MiB
llama_context:        CPU compute buffer size =    62.62 MiB
llama_context: graph nodes  = 503
llama_context: graph splits = 41
...
 Chocolate chip cookies are the most popular cookie in the world. According to the International Association of Culinary Professionals, chocolate chip cookies are the most popular cookie in the world.
In the United States, chocolate chip cookies are a beloved favorite, and many bakeries and restaurants offer them as a classic treat. They are often associated with family gatherings, road trips, and comfort food.
In fact, according to a survey conducted by YouGov in 2020, chocolate chip cookies are the most popular cookie in the United States, with 27% of respondents naming them as their favorite. This makes chocolate chip cookies the top choice among Americans, and a close second in other countries. [end of text]


llama_perf_sampler_print:    sampling time =       7.12 ms /   147 runs   (    0.05 ms per token, 20634.48 tokens per second)
llama_perf_context_print:        load time =     623.27 ms
llama_perf_context_print: prompt eval time =      81.60 ms /    11 tokens (    7.42 ms per token,   134.81 tokens per second)
llama_perf_context_print:        eval time =    2428.18 ms /   135 runs   (   17.99 ms per token,    55.60 tokens per second)
llama_perf_context_print:       total time =    2630.13 ms /   146 tokens
llama_perf_context_print:    graphs reused =        134
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  439 =   225 +     136 +      77                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  504 =   504 +       0 +       0                |



~/src/llama.cpp-hexagon$ M=LFM2-1.2B-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   105.24 MiB
load_tensors:         HTP0 model buffer size =     0.25 MiB
load_tensors:  HTP0-REPACK model buffer size =   555.75 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.25 MiB
llama_kv_cache:       HTP0 KV buffer size =    51.00 MiB
llama_kv_cache: size =   51.00 MiB (  8192 cells,   6 layers,  1/1 seqs), K (q8_0):   25.50 MiB, V (q8_0):   25.50 MiB
llama_memory_recurrent:       HTP0 RS buffer size =     0.16 MiB
llama_memory_recurrent: size =    0.16 MiB (     1 cells,  16 layers,  1 seqs), R (f32):    0.16 MiB, S (f32):    0.00 MiB
llama_context:       HTP0 compute buffer size =    14.00 MiB
llama_context:        CPU compute buffer size =    32.00 MiB
llama_context: graph nodes  = 549
llama_context: graph splits = 49
...
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

...
**(a) Chocolate chip** is the answer, but keep in mind that preferences can vary widely by region and individual taste.

For the most accurate and current data, you might want to look into recent market analyses or consumer surveys. However, chocolate chip remains a strong contender due to its universal appeal and cultural significance. [end of text]


llama_perf_sampler_print:    sampling time =       9.49 ms /   301 runs   (    0.03 ms per token, 31720.94 tokens per second)
llama_perf_context_print:        load time =     694.86 ms
llama_perf_context_print: prompt eval time =      76.28 ms /    11 tokens (    6.93 ms per token,   144.20 tokens per second)
llama_perf_context_print:        eval time =    4698.82 ms /   289 runs   (   16.26 ms per token,    61.50 tokens per second)
llama_perf_context_print:       total time =    4793.75 ms /   300 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  202 =   105 +      51 +      46                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  555 =   555 +       0 +       0                |

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025
@max-krasnyansky max-krasnyansky requested a review from lhez October 13, 2025 01:25
@jeffbolznv
Copy link
Collaborator

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

@max-krasnyansky
Copy link
Collaborator Author

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

Yeah, I saw you guys added that function recently. Will check it out.

@ggerganov
Copy link
Member

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

@max-krasnyansky
Copy link
Collaborator Author

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

Yep. Only for that. I'm going to play with the graph_optimize shortly, it's probably going to be a bit more involved/expensive. i.e the optimizer will need to re-scan the nodes and figure out the dependencies where as graph builder just needs to call build_forward after the MUL_MATs, but let me try.

@ggerganov
Copy link
Member

You can try to reuse the ggml_graph_optimize from the Metal backend:

// reorder the nodes in the graph to improve concurrency, while respecting fusion
//
// note: this implementation is generic and not specific to metal
// if it proves to work well, we can start using it for other backends in the future
void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

@slaren
Copy link
Member

slaren commented Oct 14, 2025

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case? If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

@max-krasnyansky
Copy link
Collaborator Author

You can try to reuse the ggml_graph_optimize from the Metal backend:

// reorder the nodes in the graph to improve concurrency, while respecting fusion
//
// note: this implementation is generic and not specific to metal
// if it proves to work well, we can start using it for other backends in the future
void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

@ggerganov quick update. I can definitely use graph_optimize to do the stacking of the matmuls. I was going to add some basic fusions anyway and that will require graph_optimize. So I went ahead and removed that commit for llama-model and llama-graph from this PR, everything is perfectly functional without it.
I'm working on the optimizer and will either include a simple version in this PR or in a follow up if it takes longer to implement/test.

@max-krasnyansky
Copy link
Collaborator Author

max-krasnyansky commented Oct 14, 2025

@slaren

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

That's what I started with (ie no-repack buffers) but that makes it tricky to do partial offloads and introduces lots of copies because the scheduler thinks the buffers are not shareable/reusable.
All FP32 and FP16 (once we add them) Ops can/do share the buffers right now. The repack is needed only for the quantized tensors (ie very similar to the CPU backend in that sense).

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case?

In the previous SOC generations (Gen3, Gen4, X-Elite) the CPU does technically have more memory BW (a lot more in the X-Elite). In Gen5 and X2-Elite the BW is about the same.
For Gen3/4 it might still makes sense to offload to the NPU for power saving and/or simply to free up the CPU for other tasks.
Also the NPU has a really nice DMA engine (one per HW thread) that can bring in the chunks, optionally bypassing the L2, and do some transformations (alignment, etc; we're using that in MUL_MAT ops). So it's more flexible/efficient than the CPU even if the raw BW is lower.

If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

Functionally, it's possible to use Q4_0, MXFP4, etc as is with HVX, unfortunately, it's about the worst layout for it :)
i.e Nothing is aligned, etc. It's very expensive to load and unpack.
The repack I implemented right now is very HVX friendly. Basically, it stores all the row-quants first followed by all the row-block-scales. The quants are repacked into 32x4x2 blocks (256 elements -> 2x HVX vectors). The DMA is used to properly align each row to 128 as we bring the chunks into the VTCM and then we just do nicely aligned loads of 128 bytes and expand into 256 INT8 elements. This might have to change as we add more optimizations.

It might be possible to share the same format if we were to redo ARM64 CPU_REPACK to use something like the above.
Basically, we need the scales to be separate and not mixed in because they mess up the alignment.
I was planing on spending some time to play around with the ARM64 CPU_REPACK to see if we can make something common, but right now even the CPU backend itself is using different repack layouts depending on the available instructions. So it's difficult even there.

It'd be absolutely awesome to have a common optimized format for CPU/GPU/NPU, ideally on-disk so that we mmap and don't repack at all like the original GGML, but it's very tricky in practice.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Oct 14, 2025
@max-krasnyansky
Copy link
Collaborator Author

@slaren @bandoti @CISC
Basic CI is in. I added android-ndk-build that builds vanilla ARM64 CPU and Snapdragon (CPU/GPU/NPU) flavors.
Those builds can be just pushed via ADB to the devices and run.
I'm going to hook it up to the Qualcomm Device Cloud so that we can run jobs on the actual Snapdragon-based devices, but that will need some more work. I'm going to prototype in the separate repo first.

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc.
Should we do the same for android, ubuntu, windows, etc?

@CISC
Copy link
Collaborator

CISC commented Oct 15, 2025

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc. Should we do the same for android, ubuntu, windows, etc?

It is at least ripe for some refactoring.

@max-krasnyansky
Copy link
Collaborator Author

max-krasnyansky commented Oct 16, 2025

You can try to reuse the ggml_graph_optimize from the Metal backend:

@ggerganov @jeffbolznv
I added a simple optimizer to stack up the MUL_MATs with the same input and it's working great.
It ended up being pretty simple especially with the reuse from Metal backend. Thanks for the suggestions!

Quick question for you guys.
Is the optimizer allowed to modify the nodes themselves? i.e would the mods be sticky and visible in graph_compute?
During the compute I need to know a couple of things. 1) wether the src1 can reused 2) what is the last compute op in the graph.
I'm doing both dynamically but the optimizer could just add some backend specific flags to the nodes since it already knows all that.

@jeffbolznv
Copy link
Collaborator

IMO the nodes should be treated as const.

@max-krasnyansky
Copy link
Collaborator Author

IMO the nodes should be treated as const.

Ok. Looks like Georgi agrees as well :)
I'll keep that logic in graph_compute.

@max-krasnyansky max-krasnyansky marked this pull request as ready for review October 17, 2025 06:44
@max-krasnyansky
Copy link
Collaborator Author

@slaren Let me know if you had any more thoughts on the REPACK buffers.
Currently the main features are fully functional with that change in llama-model to allow for extra_bufs for all devs.
Everything works as expected with CPU/OpenCL/Hexagon and I checked my CPU/Metal and CPU/CUDA setups and don't see any obvious issues.
One thing that I'd like to add soon-ish is the ability to override the tensors with HTP-REPACK buffers. I'm thinking for now I could just add a simple change (similar to llama-model) to allow for extra_bufs in that path.
Basically, these changes are just allowing any backend to have extra_bufs and not just the CPU.

We could revisit things and come up with a better solution in the followup. I keep thinking it be nice if we could just keep everything as a host buffer and somehow enforce set/get_tensor for specific tensors. It's sort of what the non-host buffers do but non-host is kind of an overkill in the Hexagon case. ie It ends up being a separate memory mapping, etc just to force the set/get_tensor and REPACK.


I made good progress on enabling Qualcomm Device Cloud (aka QDC) where we can run CI jobs on physical devices.
It does need a bit more work and setup. I'll reach out to you guys via email to setup some hidden CI vars using a private API_TOKEN.

@github-actions github-actions bot added the python python script changes label Oct 17, 2025
@max-krasnyansky max-krasnyansky force-pushed the hexagon branch 2 times, most recently from beb50c8 to 55ef9c8 Compare October 19, 2025 05:41
@max-krasnyansky
Copy link
Collaborator Author

Hmm. Not sure why arm64 ubuntu builds are failing with missing GGML_F32_STEP.
Seems unrelated but digging in just in case ...

@slaren
Copy link
Member

slaren commented Oct 22, 2025

Don't worry about the build failing, it is unrelated.

@max-krasnyansky max-krasnyansky merged commit 63d2fc4 into ggml-org:master Oct 22, 2025
75 checks passed
FMayran pushed a commit to FMayran/llama.cpp that referenced this pull request Oct 23, 2025
…6547)

* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <[email protected]>
Co-Authored-By: Todor Boinovski <[email protected]>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <[email protected]>
Co-authored-by: Todor Boinovski <[email protected]>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
…6547)

* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <[email protected]>
Co-Authored-By: Todor Boinovski <[email protected]>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <[email protected]>
Co-authored-by: Todor Boinovski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning python python script changes script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants