Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel#21916
Conversation
|
Hi @ggerganov and @Alcpz, could you please support in reviewing the PR |
There was a problem hiding this comment.
Nice job!
I'm assuming that you tested performance as well with Llama-3.1-8B-Q8_0? Asking as your command has a variable, but it's a guess from the perplexity.
I've tested on Graviton3, Neoverse-V1, SVE-256 + i8mm.
Checked the output of repack vs non-repack GEMM output (error is conforming with test-backend-ops NMSE threshold)
Checked that Perplexity matches with REPACK=ON before and after, identical as claimed
- LFM2-1.2B-Q8_0: PPL 16.5014 ± 0.56459
- Qwen-2B-Q8_0: PPL 10.7416 ± 0.33691
Performance (both REPACK=ON, 16 threads, llama-bench -p 512 -n 128 -r 5):
| Model | Base (e21cdc1) pp512 t/s | PR (e7d80f7) pp512 t/s | Speedup |
|---|---|---|---|
| LFM2-1.2B Q8_0 | 813.77 ± 3.31 | 902.09 ± 0.71 | 1.11 |
| Qwen3.5-2B Q8_0 | 502.09 ± 1.28 | 544.68 ± 0.69 | 1.08 |
I don't get as much performance, but it's most likely the hardware difference or the difference in GEMM dimensions from smaller models.
LGTM.
|
Yes, you're right, the testing was done for |
|
Hi @ggerganov, Can you please help in supporting this PR? |
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
Overview
This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q8_0_q8_0 gemm using i8mm and vector instructions. ARM Neon support for this kernel added Earlier.
Additional information
This PR contains the SVE implementation of the gemm used to compute the Q8_0 quantization.
Kernel: ggml_gemm_q8_0_4x8_q8_0()By running a Q8_0 quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.
This correction does not appear to have any impact on accuracy.
The command used to measure the perplexity measure is
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw --chunks 10Performance Check
This PR Improves the Prompt Eval time (TTFT) of LLM Inference by ~20%, as compared to NEON (Original Version).
The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens/second.
The command used to measure the performance is
llama-bench --model ${PATH_TO_MODEL} -n 128 -p 128 -t 4,8,16,32,64Requirements