Skip to content

Conversation

@abhijain1204fujitsu
Copy link

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q4_K_q8_K gemm using i8mm and vector instructions. ARM Neon support for this kernel added in PR #16739

Verifying Feature
----------------------------------------------------------------------------
This PR contains the SVE implementation of the gemm used to compute the Q4_K quantization.

Kernel: ggml_gemm_q4_K_8x8_q8_K()

By running a Q4_K_M quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON (Original) SVE (This PR)
13.9017 +/- 1.44495 13.8577 +/- 1.44081

This correction does not appear to have any impact on accuracy.

The command used to measure the perplexity measure is

./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw --chunks 4

Performance Check
----------------------------------------------------------------------------

This PR Improves the Prompt Eval time (TTFT) of LLM Inference by 17-20%, as compared to NEON (PR #16739).

The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens per second.

Threads NEON (Original) SVE (This PR) Speedup
4 24.67 29.77 1.20
8 49.05 59.35 1.21
16 97.33 117.62 1.20
32 186.03 221.68 1.19
64 324.55 381.08 1.17

The command used to measure the performance is

llama-bench  --model ${PATH_TO_MODEL} -n 128 -p 128 -t 4,8,16,32,64

This work is a contribution of @Vithulep and @abhijain1204fujitsu

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2026
@ggerganov
Copy link
Member

cc @Alcpz

@pvname
Copy link
Contributor

pvname commented Jan 28, 2026

Regarding CI Failure

When I ran the same command on my system, it build correctly with no issue. Can we check or Rerun the CI pipeline.

We have not made any changes in CMake or x86 code.

I am attaching the logs.

cmake -B build -DLLAMA_BUILD_BORINGSSL=ON -DGGML_SCHED_NO_REALLOC=ON
  cmake --build build --config RelWithDebInfo -j ${env:NUMBER_OF_PROCESSORS} --target llama-server 
-- The C compiler identification is GNU 13.1.0
-- The CXX compiler identification is GNU 13.1.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.34.1") 
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM detected flags: -mcpu=zeus+crc+aes+sha3+sm4
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Failed
-- Checking for ARM features using flags:
--   -mcpu=zeus+crc+aes+sha3+sm4+dotprod+i8mm+sve
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Success
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Success
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Success
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Success
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -mcpu=zeus+crc+aes+sha3+sm4+dotprod+i8mm+sve 
-- ggml version: 0.9.5
-- ggml commit:  c3d8907de
-- Fetching BoringSSL version 0.20251002.0
-- Generating embedded license file for target: common
-- Configuring done (26.7s)
-- Generating done (0.4s)
-- Build files have been written to: /home/prashantv/fj-prop-test/llama.cpp/build

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I don't see any issues with the existing implementation, so all good from my perspective. Please, also try to run clang-format on your changes, there are some inconsistencies in the style.

constexpr int q8_k_blocklen = 4;
const uint8x16_t m4b = vdupq_n_u8(0x0f);
#if defined(__aarch64__) && defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8)
if (svcntb()*8 == 256) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format

}

// q8_ptr[b].qs has interleaved Q8 rows (01, 23)
// const int8_t * q8_base = q8_ptr[b].qs + sb * 256;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is redundant commented code. Some comments could be improved a bit as well.


for (int y = 0; y < nr / q8_k_blocklen; y++) {
const block_q8_Kx4 * GGML_RESTRICT q8_ptr = (const block_q8_Kx4 *) vy + (y * nb);
const block_q8_Kx4 * GGML_RESTRICT q8_ptr_1 = (const block_q8_Kx4 *) vy + (y * nb);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for the same variable twice, I don't see them being used in a way that makes this necessary. Either clarify or cleanup.

acc_f32_67 = svdup_n_f32(0);

for (int b = 0; b < nb; b++) {
// bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum
// bsums pairs belongs to the same q8_k subblock
// 64 elements loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum

@Alcpz
Copy link
Collaborator

Alcpz commented Jan 28, 2026

Regarding CI Failure

When I ran the same command on my system, it build correctly with no issue. Can we check or Rerun the CI pipeline.

We have not made any changes in CMake or x86 code.

The Server failures are due to changes in the CI. If you rebase on top of master you should get rid of those. I also saw some issues with the `x86 high performance job failing on other pipelines, but as you say, this is not caused here.

@abhijain1204fujitsu abhijain1204fujitsu force-pushed the gemm_q4_K_8x8_q8_K_Kernel_SVE_Porting branch from c75f491 to 1d4d342 Compare January 29, 2026 07:46
@abhijain1204fujitsu
Copy link
Author

@Alcpz rebase and format related changes are pushed,
Kindly support to further review the PR

Thank you !

@pvname
Copy link
Contributor

pvname commented Feb 3, 2026

Hi @ggerganov and @Alcpz Please review the code. I don't why @loci-dev is not showing the performance gain. My guess he might not utilizing the SVE code.

I am to good help if you need anything else.

Thank you.

@Alcpz
Copy link
Collaborator

Alcpz commented Feb 3, 2026

Hi @ggerganov and @Alcpz Please review the code. I don't why @loci-dev is not showing the performance gain. My guess he might not utilizing the SVE code.

I am to good help if you need anything else.

Thank you.

I've already gave my review, unfortunately, I don't have access to an SVE system so I can't really dig further into it. I'm trusting everything was tested accordingly since it's GEMM, so Perplexity should be enough to detect failures.

Sorry I can't be of more help.

@taronaeo

This comment was marked as outdated.

…nce slowing the performance.

So added code if SVE 256 is not present then use NEON code.
@abhijain1204fujitsu
Copy link
Author

Hi @taronaeo Thanks for running the benchmark on c8gn.2xlarge. This is Graviton 4 machine.

Graviton 4 has SVE vector length of 128-bits, and current code is written for SVE vector length 256 bits.

So, when you are running with this PR it does not utilize both NEON and SVE code and uses ggml_gemm_q4_K_8x8_q8_K_generic, hence you saw huge gap in performance.

Now currently I have modified the code such way that if
SVE + Vector length == 256 then use SVE
else Check for NEON
else USE Generic Kernel.

So now with these changes, you will see the similar performance for both NEON and this PR results.

I am attaching my runtime results for c8gn.48larg (Graviton 4 128-bit vector length) results.

For NEON

model size params backend threads test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 pp128 33.61 ± 0.51
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 tg128 15.79 ± 0.30

----------------------------------------------------------------
This PR

model size params backend threads test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 pp128 33.61 ± 0.51
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 tg128 15.79 ± 0.30

I hope this clears your doubt.

Thank you.

@taronaeo
Copy link
Collaborator

taronaeo commented Feb 9, 2026

Graviton 4 has SVE vector length of 128-bits

Great catch, I forgot about that. I can retest it on Graviton 3 where 256-bit SVE is available and update the benchmarks again :)

@taronaeo
Copy link
Collaborator

I've tested this using an AWS hpc7g.16xlarge instance and managed to reproduce your performance improvement. Great job!

model size params backend threads test t/s MASTER t/s PR speedup
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 4 pp512 27.46 ± 0.00 31.87 ± 0.00 1.16
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 4 tg128 9.18 ± 0.00 9.38 ± 0.00 1.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 pp512 54.63 ± 0.00 63.32 ± 0.01 1.16
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 8 tg128 17.00 ± 0.00 17.45 ± 0.01 1.03
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 16 pp512 109.16 ± 0.01 125.77 ± 0.03 1.15
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 16 tg128 29.29 ± 0.01 30.06 ± 0.02 1.03
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 32 pp512 214.96 ± 0.04 247.34 ± 0.05 1.15
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 32 tg128 36.75 ± 0.02 36.89 ± 0.02 1.00
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 64 pp512 406.50 ± 2.41 463.26 ± 4.64 1.14
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 64 tg128 39.08 ± 0.69 39.22 ± 0.56 1.00

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code cleanup

Comment on lines +3130 to +3134
const uint32_t mins_0_3 = sm[1] & kmask1;
const uint32_t mins_4_7 = ((sm[2] >> 4) & kmask2) | (((sm[1] >> 6) & kmask3) << 4);

const uint32_t mins_0_3_1 = sm1[1] & kmask1;
const uint32_t mins_4_7_1 = ((sm1[2] >> 4) & kmask2) | (((sm1[1] >> 6) & kmask3) << 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const uint32_t mins_0_3 = sm[1] & kmask1;
const uint32_t mins_4_7 = ((sm[2] >> 4) & kmask2) | (((sm[1] >> 6) & kmask3) << 4);
const uint32_t mins_0_3_1 = sm1[1] & kmask1;
const uint32_t mins_4_7_1 = ((sm1[2] >> 4) & kmask2) | (((sm1[1] >> 6) & kmask3) << 4);
const uint32_t mins_0_3 = sm[1] & kmask1;
const uint32_t mins_4_7 = ((sm[2] >> 4) & kmask2) | (((sm[1] >> 6) & kmask3) << 4);
const uint32_t mins_0_3_1 = sm1[1] & kmask1;
const uint32_t mins_4_7_1 = ((sm1[2] >> 4) & kmask2) | (((sm1[1] >> 6) & kmask3) << 4);

Comment on lines +3140 to +3141
svuint8_t mins_u8 = svreinterpret_u8_u32(mins_u32_temp);
svuint8_t mins_u8_1 = svreinterpret_u8_u32(mins_u32_temp_1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
svuint8_t mins_u8 = svreinterpret_u8_u32(mins_u32_temp);
svuint8_t mins_u8_1 = svreinterpret_u8_u32(mins_u32_temp_1);
svuint8_t mins_u8 = svreinterpret_u8_u32(mins_u32_temp);
svuint8_t mins_u8_1 = svreinterpret_u8_u32(mins_u32_temp_1);

Comment on lines +3147 to +3149
q4sb_mins_0 = svreinterpret_s32_u32(mins_u16);

q4sb_mins_1 = svreinterpret_s32_u32(mins_u16_1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
q4sb_mins_0 = svreinterpret_s32_u32(mins_u16);
q4sb_mins_1 = svreinterpret_s32_u32(mins_u16_1);
q4sb_mins_0 = svreinterpret_s32_u32(mins_u16);
q4sb_mins_1 = svreinterpret_s32_u32(mins_u16_1);

Comment on lines +3184 to +3193

svint8_t q8_qs_0 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 0), svld1_s8(pl16, q8_base_1 + 112));
svint8_t q8_qs_2 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 32), svld1_s8(pl16, q8_base_1 + 144));
svint8_t q8_qs_4 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 64), svld1_s8(pl16, q8_base_1 + 176));
svint8_t q8_qs_6 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 96), svld1_s8(pl16, q8_base_1 + 208));

svint8_t q8_qs_1 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 16), svld1_s8(pl16, q8_base_1 + 128));
svint8_t q8_qs_3 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 48), svld1_s8(pl16, q8_base_1 + 160));
svint8_t q8_qs_5 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 80), svld1_s8(pl16, q8_base_1 + 192));
svint8_t q8_qs_7 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 112), svld1_s8(pl16, q8_base_1 + 224));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
svint8_t q8_qs_0 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 0), svld1_s8(pl16, q8_base_1 + 112));
svint8_t q8_qs_2 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 32), svld1_s8(pl16, q8_base_1 + 144));
svint8_t q8_qs_4 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 64), svld1_s8(pl16, q8_base_1 + 176));
svint8_t q8_qs_6 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 96), svld1_s8(pl16, q8_base_1 + 208));
svint8_t q8_qs_1 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 16), svld1_s8(pl16, q8_base_1 + 128));
svint8_t q8_qs_3 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 48), svld1_s8(pl16, q8_base_1 + 160));
svint8_t q8_qs_5 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 80), svld1_s8(pl16, q8_base_1 + 192));
svint8_t q8_qs_7 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 112), svld1_s8(pl16, q8_base_1 + 224));
svint8_t q8_qs_0 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 0), svld1_s8(pl16, q8_base_1 + 112));
svint8_t q8_qs_2 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 32), svld1_s8(pl16, q8_base_1 + 144));
svint8_t q8_qs_4 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 64), svld1_s8(pl16, q8_base_1 + 176));
svint8_t q8_qs_6 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 96), svld1_s8(pl16, q8_base_1 + 208));
svint8_t q8_qs_1 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 16), svld1_s8(pl16, q8_base_1 + 128));
svint8_t q8_qs_3 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 48), svld1_s8(pl16, q8_base_1 + 160));
svint8_t q8_qs_5 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 80), svld1_s8(pl16, q8_base_1 + 192));
svint8_t q8_qs_7 = svadd_s8_x(svptrue_b8(), svld1_s8(ph16, q8_base_1 + 112), svld1_s8(pl16, q8_base_1 + 224));

Comment on lines +3206 to +3209
svint8_t q4_nibbles_00 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_00, m4b_1), 4));
svint8_t q4_nibbles_01 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_01, m4b_1), 4));
svint8_t q4_nibbles_02 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_02, m4b_1), 4));
svint8_t q4_nibbles_03 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_03, m4b_1), 4));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
svint8_t q4_nibbles_00 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_00, m4b_1), 4));
svint8_t q4_nibbles_01 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_01, m4b_1), 4));
svint8_t q4_nibbles_02 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_02, m4b_1), 4));
svint8_t q4_nibbles_03 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_03, m4b_1), 4));
svint8_t q4_nibbles_00 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_00, m4b_1), 4));
svint8_t q4_nibbles_01 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_01, m4b_1), 4));
svint8_t q4_nibbles_02 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_02, m4b_1), 4));
svint8_t q4_nibbles_03 = svreinterpret_s8_u8(svlsr_n_u8_m(pl16, svand_u8_m(ph16, q4_qs_cp_03, m4b_1), 4));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants