Skip to content

hexagon: dma optimizations (mostly fixing regressions)#21137

Merged
max-krasnyansky merged 3 commits into
ggml-org:masterfrom
qualcomm:hexagon-dma-opts
Mar 29, 2026
Merged

hexagon: dma optimizations (mostly fixing regressions)#21137
max-krasnyansky merged 3 commits into
ggml-org:masterfrom
qualcomm:hexagon-dma-opts

Conversation

@max-krasnyansky
Copy link
Copy Markdown
Member

Overview

Somehow I missed the significant perf regression when I did the last big DMA update.
I flipped the in-order bit in the dma descriptors and it turns out it causes a 3-5 TPS drop, especially for the token gen.
We don't really need true in order processing by the HW anyway as our pipelines are setup such that we explicitly wait for specific descriptors to complete (i.e enforcing the ordering that the kernels expect when they do dma_push/pop).

This PR also adds a neat little DMA cache that can be used in kernels that may need to re-fetch the data.
This is now used in the FA kernel for the Mask.

Before/after numbers on Gen3,4,5
M=../gguf/Llama-3.2-3B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -f ../surfing.txt -st -n 64

- Gen5
 prompt eval time =     992.17 ms /   205 tokens (    4.84 ms per token,   206.62 tokens per second)
        eval time =    2654.97 ms /    63 runs   (   42.14 ms per token,    23.73 tokens per second)

 prompt eval time =     979.93 ms /   205 tokens (    4.78 ms per token,   209.20 tokens per second)
        eval time =    2490.30 ms /    63 runs   (   39.53 ms per token,    25.30 tokens per second)

- Gen4 (S25+)
 prompt eval time =    1269.23 ms /   205 tokens (    6.19 ms per token,   161.52 tokens per second)
        eval time =    3049.34 ms /    63 runs   (   48.40 ms per token,    20.66 tokens per second)

 prompt eval time =    1264.30 ms /   205 tokens (    6.17 ms per token,   162.14 tokens per second)
        eval time =    2723.60 ms /    63 runs   (   43.23 ms per token,    23.13 tokens per second)

- Gen3 (S24U)
 prompt eval time =    1379.95 ms /   205 tokens (    6.73 ms per token,   148.56 tokens per second)
        eval time =    3495.07 ms /    63 runs   (   55.48 ms per token,    18.03 tokens per second)

 prompt eval time =    1390.60 ms /   205 tokens (    6.72 ms per token,   149.01  tokens per second)
        eval time =    2884.50 ms /    63 runs   (   45.79 ms per token,    21.84 tokens per second)

Requirements

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
@max-krasnyansky max-krasnyansky requested a review from a team as a code owner March 28, 2026 23:48
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Mar 28, 2026
@max-krasnyansky max-krasnyansky merged commit f5d1c41 into ggml-org:master Mar 29, 2026
44 of 45 checks passed
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
@max-krasnyansky max-krasnyansky deleted the hexagon-dma-opts branch April 15, 2026 01:20
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants