Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel by hrushitfujitsu · Pull Request #21916 · ggml-org/llama.cpp

hrushitfujitsu · 2026-04-14T17:23:07Z

Overview

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q8_0_q8_0 gemm using i8mm and vector instructions. ARM Neon support for this kernel added Earlier.

Additional information

This PR contains the SVE implementation of the gemm used to compute the Q8_0 quantization.
Kernel: ggml_gemm_q8_0_4x8_q8_0()

By running a Q8_0 quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.

Neon(OSS)	SVE(This PR)
12.9252 +/- 0.78720	12.9252 +/- 0.78720

This correction does not appear to have any impact on accuracy.
The command used to measure the perplexity measure is

./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw --chunks 10

Performance Check

This PR Improves the Prompt Eval time (TTFT) of LLM Inference by ~20%, as compared to NEON (Original Version).

The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens/second.

Threads	NEON (Original)	SVE (This PR)	Speedup
4	39.50	48.72	1.23
8	79.59	96.21	1.20
16	157	189.44	1.20
32	298.46	361.51	1.21
64	525.12	580.29	1.10

The command used to measure the performance is
llama-bench --model ${PATH_TO_MODEL} -n 128 -p 128 -t 4,8,16,32,64

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: None

hrushitfujitsu · 2026-04-21T05:40:32Z

Hi @ggerganov and @Alcpz, could you please support in reviewing the PR

Alcpz

Nice job!
I'm assuming that you tested performance as well with Llama-3.1-8B-Q8_0? Asking as your command has a variable, but it's a guess from the perplexity.

I've tested on Graviton3, Neoverse-V1, SVE-256 + i8mm.

Checked the output of repack vs non-repack GEMM output (error is conforming with test-backend-ops NMSE threshold)

Checked that Perplexity matches with REPACK=ON before and after, identical as claimed

LFM2-1.2B-Q8_0: PPL 16.5014 ± 0.56459
Qwen-2B-Q8_0: PPL 10.7416 ± 0.33691

Performance (both REPACK=ON, 16 threads, llama-bench -p 512 -n 128 -r 5):

Model	Base (`e21cdc1`) pp512 t/s	PR (`e7d80f7`) pp512 t/s	Speedup
LFM2-1.2B Q8_0	813.77 ± 3.31	902.09 ± 0.71	1.11
Qwen3.5-2B Q8_0	502.09 ± 1.28	544.68 ± 0.69	1.08

I don't get as much performance, but it's most likely the hardware difference or the difference in GEMM dimensions from smaller models.

LGTM.

hrushitfujitsu · 2026-04-22T05:10:39Z

Yes, you're right, the testing was done for Llama-3.1-8B-Q8_0, Thank you for the review :)

hrushitfujitsu · 2026-04-28T06:42:10Z

Hi @ggerganov, Can you please help in supporting this PR?

…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>

…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com> (cherry picked from commit bdc9c74)

…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>

Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

e7d80f7

hrushitfujitsu requested a review from ggerganov as a code owner April 14, 2026 17:23

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 14, 2026

Alcpz approved these changes Apr 21, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated

Change arrays to static const in repack.cpp

c337d29

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 28, 2026

ggerganov merged commit bdc9c74 into ggml-org:master Apr 29, 2026
48 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel#21916

Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel#21916
ggerganov merged 2 commits into
ggml-org:masterfrom
MonakaResearch:q8_0_gemm_kernel_sve_tuning

hrushitfujitsu commented Apr 14, 2026

Uh oh!

hrushitfujitsu commented Apr 21, 2026 •

edited

Loading

Uh oh!

Alcpz left a comment •

edited

Loading

Uh oh!

Uh oh!

hrushitfujitsu commented Apr 22, 2026

Uh oh!

hrushitfujitsu commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hrushitfujitsu commented Apr 14, 2026

Overview

Additional information

Performance Check

Requirements

Uh oh!

hrushitfujitsu commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hrushitfujitsu commented Apr 22, 2026

Uh oh!

hrushitfujitsu commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hrushitfujitsu commented Apr 21, 2026 •

edited

Loading

Alcpz left a comment •

edited

Loading

hrushitfujitsu commented Apr 28, 2026 •

edited

Loading