Skip to content

Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel#21916

Merged
ggerganov merged 2 commits into
ggml-org:masterfrom
MonakaResearch:q8_0_gemm_kernel_sve_tuning
Apr 29, 2026
Merged

Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel#21916
ggerganov merged 2 commits into
ggml-org:masterfrom
MonakaResearch:q8_0_gemm_kernel_sve_tuning

Conversation

@hrushitfujitsu

Copy link
Copy Markdown
Contributor

Overview

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q8_0_q8_0 gemm using i8mm and vector instructions. ARM Neon support for this kernel added Earlier.

Additional information

This PR contains the SVE implementation of the gemm used to compute the Q8_0 quantization.
Kernel: ggml_gemm_q8_0_4x8_q8_0()

By running a Q8_0 quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.

Neon(OSS) SVE(This PR)
12.9252 +/- 0.78720 12.9252 +/- 0.78720

This correction does not appear to have any impact on accuracy.
The command used to measure the perplexity measure is

./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw --chunks 10

Performance Check

This PR Improves the Prompt Eval time (TTFT) of LLM Inference by ~20%, as compared to NEON (Original Version).

The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens/second.

Threads NEON (Original) SVE (This PR) Speedup
4 39.50 48.72 1.23
8 79.59 96.21 1.20
16 157 189.44 1.20
32 298.46 361.51 1.21
64 525.12 580.29 1.10

The command used to measure the performance is
llama-bench --model ${PATH_TO_MODEL} -n 128 -p 128 -t 4,8,16,32,64

Requirements

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 14, 2026
@hrushitfujitsu

hrushitfujitsu commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Hi @ggerganov and @Alcpz, could you please support in reviewing the PR

@Alcpz Alcpz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!
I'm assuming that you tested performance as well with Llama-3.1-8B-Q8_0? Asking as your command has a variable, but it's a guess from the perplexity.

I've tested on Graviton3, Neoverse-V1, SVE-256 + i8mm.

Checked the output of repack vs non-repack GEMM output (error is conforming with test-backend-ops NMSE threshold)

Checked that Perplexity matches with REPACK=ON before and after, identical as claimed

  • LFM2-1.2B-Q8_0: PPL 16.5014 ± 0.56459
  • Qwen-2B-Q8_0: PPL 10.7416 ± 0.33691

Performance (both REPACK=ON, 16 threads, llama-bench -p 512 -n 128 -r 5):

Model Base (e21cdc1) pp512 t/s PR (e7d80f7) pp512 t/s Speedup
LFM2-1.2B Q8_0 813.77 ± 3.31 902.09 ± 0.71 1.11
Qwen3.5-2B Q8_0 502.09 ± 1.28 544.68 ± 0.69 1.08

I don't get as much performance, but it's most likely the hardware difference or the difference in GEMM dimensions from smaller models.

LGTM.

Comment thread ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated
@hrushitfujitsu

Copy link
Copy Markdown
Contributor Author

Yes, you're right, the testing was done for Llama-3.1-8B-Q8_0, Thank you for the review :)

@hrushitfujitsu

hrushitfujitsu commented Apr 28, 2026

Copy link
Copy Markdown
Contributor Author

Hi @ggerganov, Can you please help in supporting this PR?

@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 28, 2026
@ggerganov ggerganov merged commit bdc9c74 into ggml-org:master Apr 29, 2026
48 of 49 checks passed
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
(cherry picked from commit bdc9c74)
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants