New integer trellis on ARM_NEON #544

ikawrakow · 2025-06-20T05:53:28Z

This PR adapts the ARM_NEON trellis implementation to the new integer trellis.

Test done on an M2-Max CPU using LlaMA-3.1-8B-Instruct.

Very respectable PP performance:

model	size	test	t/s
llama 8B IQ2_KT	2.77 GiB	pp512	129.19 ± 0.22
llama 8B IQ3_KT	3.58 GiB	pp512	127.66 ± 0.38
llama 8B IQ4_KT	4.30 GiB	pp512	125.23 ± 0.44

Still very low TG performance:

model	size	test	t/s
llama 8B IQ2_KT	2.77 GiB	tg128	12.59 ± 0.15
llama 8B IQ3_KT	3.58 GiB	tg128	9.92 ± 0.02
llama 8B IQ4_KT	4.30 GiB	tg128	9.73 ± 0.05

Don't ask Apple Silicon to do too much work with a piece of data fetched from memory.

Nevertheless, compared to PR #471 we observe ~13% speedup for IQ2_KT, ~30% speedup for IQ3_KT, and nearly 70% speedup for Q4_KT.

Iwan Kawrakow added 2 commits June 19, 2025 19:38

Adapt iq3_kt to new trellis on NEON

031eada

iq3_kt is now working on NEON

a45e368

ikawrakow merged commit 1843ed2 into main Jun 20, 2025

Provide feedback