ggml-cpu: optimize avx2 q6_k by netrunnereve · Pull Request #22345 · ggml-org/llama.cpp

netrunnereve · 2026-04-25T02:25:26Z

Basically I took the optimizations I did for AVX a while back and brought them over to AVX2.

PR:

model	size	params	backend	threads	test	t/s
llama 1B Q6_K	860.86 MiB	1.10 B	CPU	4	pp512	63.15 ± 0.34
llama 1B Q6_K	860.86 MiB	1.10 B	CPU	4	tg128	15.63 ± 0.08

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  345 runs -  3408.30 us/run - 117.44 MFLOP/run -  34.46 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  315 runs -  3435.15 us/run - 234.88 MFLOP/run -  68.38 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  207 runs -  4928.92 us/run - 352.32 MFLOP/run -  71.48 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  252 runs -  4204.82 us/run - 469.76 MFLOP/run - 111.72 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  210 runs -  5041.41 us/run - 587.20 MFLOP/run - 116.48 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  144 runs -  7309.56 us/run - 939.52 MFLOP/run - 128.53 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  3 runs - 417778.33 us/run -  60.13 GFLOP/run - 143.93 GFLOPS

Master:

model	size	params	backend	threads	test	t/s
llama 1B Q6_K	860.86 MiB	1.10 B	CPU	4	pp512	49.75 ± 0.40
llama 1B Q6_K	860.86 MiB	1.10 B	CPU	4	tg128	15.53 ± 0.19

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  276 runs -  4364.08 us/run - 117.44 MFLOP/run -  26.91 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  280 runs -  3598.24 us/run - 234.88 MFLOP/run -  65.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  207 runs -  5086.99 us/run - 352.32 MFLOP/run -  69.26 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  180 runs -  5805.74 us/run - 469.76 MFLOP/run -  80.91 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  168 runs -  6138.51 us/run - 587.20 MFLOP/run -  95.66 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  108 runs -  9657.79 us/run - 939.52 MFLOP/run -  97.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  2 runs - 594931.50 us/run -  60.13 GFLOP/run - 101.07 GFLOPS

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

q6_k

b97f5fc

netrunnereve requested a review from ggerganov as a code owner April 25, 2026 02:25

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 25, 2026

ggerganov approved these changes Apr 25, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 25, 2026

ggerganov merged commit 2dd8416 into ggml-org:master Apr 26, 2026
43 of 46 checks passed

netrunnereve deleted the q6_k branch April 28, 2026 20:41

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

e634e10

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

2b706f0

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

733b851

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

9d85525

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

c8e9f5b

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

ggml-cpu: optimize avx2 q6_k (ggml-org#22345)

1788c0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: optimize avx2 q6_k#22345

ggml-cpu: optimize avx2 q6_k#22345
ggerganov merged 1 commit into
ggml-org:masterfrom
netrunnereve:q6_k

netrunnereve commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

netrunnereve commented Apr 25, 2026

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants