ggml-cpu : add STQ1_0 ternary quantization with ARM NEON vec_dot kernel#22836
ggml-cpu : add STQ1_0 ternary quantization with ARM NEON vec_dot kernel#22836sjl623 wants to merge 3 commits into
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
This seems interesting, @khosravipasha something for you? Naming-wise it should probably be |
|
@sjl623 If you are doing benchmarks, you could also add |
Hi @CISC, thanks for taking a look! We've renamed the format to STQ1_0 according to your suggestion. Happy to hear further thoughts :) |
Good call, thanks. I've added Q1_0 to the benchmark table. STQ1_0 actually still comes out ahead on tg128, which was a bit unexpected given Q1_0 is binary and has lower bpw. Separately, per the Sherry paper, the 1.25-bit ternary setting also has better accuracy than the 1-bit binary baseline. |
|
Hi @ggerganov, could you take a look at this PR when you have time? Thanks! |
Compile |
Hi, thanks for the report! Quick check: did you download the .gguf from HF directly, or convert one yourself per the docs? The HF gguf was produced with an older quant type ID and isn't loadable as-is. Please re-convert following the doc instructions (we'll refresh the HF weights later). Also note: currently only ARM (Apple M-series) is supported; x86 isn't yet. |
I downloaded |
@cherish-ltt |
|
Using Android app demo from https://huggingface.co/tencent/Hy-MT1.5-1.8B-1.25bit-GGUF: |
@dima-xd , The source gguf of APK was replaced by mistake yesterday, now it should be ok. Just reinstall and redownload the gguf. |
Thanks, it works now! |
|
./build/bin/llama-completion still error |
Looks like it's being killed due to OOM. The default context (262144) makes the KV cache ~16 GiB, which may exceed host memory. Could you try again with |
./build/bin/llama-completion --model model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf -p "Translate the following segment into Vietnamese, without additional explanation. Hello " --jinja -ngl 0 -c 4096 -n 64 -st system_info: n_threads = 4 (n_threads_batch = 4) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | sampler seed: 3567722247 Translate the following segment into Vietnamese, without additional explanation. Hello Xin chào.َايْتُمْ بَعْضُنُمْ بَعْلَيٰ ٌبَعْلَيٰ ٌبَعْلَيٰ… الْمُنْ common_perf_print: sampling time = 13.33 ms performance and result not good. it only run with arm?? |
Yes, currently the acceleration only supports ARM chips. You can try it on an M-series MacBook, or download our APK to experience it directly. |
|
approved. |
|
I don't think we can justify the introduction of |
Thanks for the feedback! We've retrained the model with stride-16 3:4 sparsity so that each NEON lane naturally aligns with the standard Q8_K layout after unpack. No repack, no deinterleave, no new type. You can find more details in the "STQ1_0: Stride-16 Sparsity Layout" section of the PR description. Please let us know if there are any further concerns :) |
|
Hi @CISC @ggerganov , CI is green now on my fork https://github.com/sjl623/llama.cpp/actions/runs/26734514829. There's also some interest from the community — e.g.
Could you let me know how best to move this forward? Happy to make any changes needed. Thanks! |
Build and Test Report (GCC14 + DOTPROD Enabled)@sjl623 Hi, I'm getting incorrect translation outputs when using the 1.25-bit GGUF quantization (both Neon and scalar). Could you please take a look? Scope
Model Checksums (SHA256)
Environment
Build (GCC14, forced ARM dotprod)Command: cmake -S . -B build-gcc14-dotprod \
-DCMAKE_C_COMPILER=gcc-14 \
-DCMAKE_CXX_COMPILER=g++-14 \
-DGGML_NATIVE=OFF \
-DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod && \
cmake --build build-gcc14-dotprod --target llama-completion llama-bench -j"$(nproc)"Result:
Runtime Tests1) llama-completionCommand: ./build-gcc14-dotprod/bin/llama-completion --model ./Hy-MT2-1.8B-1.25Bit.gguf -p "Translate the following segment into Chinese, without additional explanation:Hello" --jinja -ngl 0 -c 4096 -n 64 -stResult:
Observed garbled output (exact excerpt): Conclusion
|
Thanks for your interest! The 1.25-bit packing format was updated (stride-16) in this PR, so the old GGUF no longer The Hy-MT2 1.25-bit (stride-16) model is training and will be released soon. |
|
The performance is abnormal. It's only about 1 token/s. |
Looks like you're running on x86 (Intel i3, AVX2). At the moment the accelerated kernel is ARM-only, so on x86 it falls back to a naive C implementation, which is much slower. An optimized x86 kernel is already on the way. |
|
Can't we integrate Hymt with STQ acceleration in Android now? |
|
Took a look at this on Apple Silicon since the failing CI jobs are ARM/x86 high-perf and I have an M1 Pro handy. Sharing a datapoint and where I think the bug is. On M1 (ARMv8.2, That points the finger at the table-lookup emulation on the other toolchains rather than the algorithm. My main suspect for the red One other thing: Caveat: I can only test Apple clang on M1, so this localizes the issue to the x86/GCC table-lookup emulation rather than clearing those paths. Happy to run anything else on ARM if it helps. |
Thanks for your interest. The answer is YES! Two ways to try it on Android:
|
Thanks! I got correct translation output with this GGUF. FYI, the shasum256 code of this GGUF is 93e025c93cc082e73a3f142b757623a8b9cf541c020a8013ca4ee669556860ab. |
Thanks a lot for digging into this and confirming the correctness of the kernel implementation. Really appreciate the result that the NEON kernel is bit-for-bit identical to the scalar reference, plus the clean UBSan(-fsanitize=undefined) run. 🙏
I'll rebase to latest master and try again. |
1.3125 bpw quantization. Each block of 256 elements stores 64 groups of 4
ternary lanes (-1/0/+1) with the constraint that every group has exactly
one zero and three non-zero lanes of identical magnitude. The ternary
pattern is encoded as a 4-bit codebook index plus a 1-bit global sign,
yielding 32 patterns over 4 lanes (5 bits / 4 lanes = 1.25 bpw payload),
plus a per-block fp16 scale (0.0625 bpw) for 1.3125 bpw total.
Components:
- block_stq_0 layout and codebook in ggml-common.h
- reference quantize/dequantize/validate in ggml-quants.{h,c}
- generic CPU vec_dot in ggml-cpu/quants.{h,c}
- ARM NEON vec_dot using vqtbl2q for codebook lookup, vdotq_s32 for
accumulation, plus vld4q-based in-place repack of Q8_K activations
- enum slots: GGML_TYPE_STQ_0 = 42, LLAMA_FTYPE_MOSTLY_STQ_0 = 41,
GGMLQuantizationType.STQ_0 = 42, LlamaFileType.MOSTLY_STQ_0 = 41
- llama-quantize CLI option "STQ_0"
Hi @CISC, I have rebased to latest master. Could you trigger the CI again? Thanks! |
|
Thanks for the clarification, that's good to know. If One thing that follows from that, though: because both high-perf jobs were canceled or unrelated, the x86/GCC build never actually ran a correctness check over STQ1_0, so the Happy to run the rebased branch on M1 once it's up to confirm it stays green on ARM. Thanks for the work on this. |
I integrated it and tested it on a Snapdragon 870 Android phone. The speed was only 9-13. Is it because the device is too poor or there is a mistake? |
Introduction
This PR adds STQ1_0 (Sparse Ternary Quantization), a new hardware-efficient ternary quantization kernel to support Sherry quantization, which recently accepted to ACL 2026 (Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification), where each weight is constrained to {-d, 0, +d} with the structural rule that exactly one of every four lanes is zero, yielding 1.3125 bits per weight (5 bits per 4-weight group, i.e. a 4-bit codebook index + 1-bit sign, plus a single
fp16scale per 256-weight block: 42 B / 256 = 1.3125 bpw) while admitting fast SIMD decode through a 32-entry codebook lookup.In llama.cpp terms, the 3:4 pattern lets STQ1_0 sit between the existing ternary formats: smaller than TQ2_0 (1.3125 vs 2.0625 bpw) and faster than TQ1_0, whose 1.6875-bit 3-way packing is SIMD-unfriendly; STQ1_0's power-of-two 4-way blocks decode directly through
vqtbl2q+vdotq_s32with no bit-shuffling.Below is a more concrete walkthrough of how our kernel actually decodes a block: the on-disk layout (

qs[32]4-bit codebook indices +sign[8]1-bit per group + anfp16scale), the codebook of 4-ternary lanes, and a step-by-step dequantization example showingcodebook lookup → sign flip → scale.Building on this method, Tencent Hunyuan has applied STQ1_0 to compress the Hy-MT1.5-1.8B model and released it to the open-source community, where it has received broad attention: the model has surpassed 16k+ downloads on Hugging Face within one week (4/29-5/5): AngelSlim/Hy-MT1.5-1.8B-1.25bit. In parallel, Tencent Hunyuan has built an on-device offline translation app on top of
llama.cpp; this PR contributes the 1.25-bit kernel implementation that powers it.There has also been community interest in adding Sherry support to llama.cpp from both directions: users have asked for this format on the llama.cpp side (discussion #19123) and on the model card side (Hy-MT1.5-1.8B-1.25bit / discussions / 3, specifically requesting the kernel). This PR is the upstream answer to those asks.
STQ1_0: Stride-16 Sparsity Layout
Notably, STQ1_0's 3:4 sparsity adopts a stride-16 grouping pattern, which is key to achieving efficient SIMD execution without extra data shuffling. Specifically, instead of grouping 4 consecutive weights (w0, w1, w2, w3), STQ1_0 groups weights that are stride-16 within each 64-weight chunk (e.g. w0, w16, w32, w48), where exactly one of the four is zero and the other three take values in {−d, +d}.
The motivation is SIMD alignment. After the codebook unpack (SHR 0/2/4/6 + mask), each NEON lane register (sqx0–sqx3) holds a contiguous run of weight indices. Since standard Q8_K activation quantization stores y values sequentially in memory, the lane contents and y are naturally aligned — a plain
vld1q_s8at offsets 0/16/32/48 is all that is needed, with no deinterleave or repack required.Acknowledge
Special thanks to @Little0o0 for proposing the Sherry method and for the helpful discussions on adapting its 3:4 packing layout to llama.cpp.
How to use it
1. Clone and build
cd llama.cpp cmake -B build cmake --build build --config Release -j2. Download the HF model
pip install huggingface_hub huggingface-cli download AngelSlim/Hy-MT1.5-1.8B-1.25bit \ --local-dir Hy-MT1.5-1.8B-1.25bitModel card: AngelSlim/Hy-MT1.5-1.8B-1.25bit.
3. Convert HF → GGUF (bf16)
python convert_hf_to_gguf.py Hy-MT1.5-1.8B-1.25bit \ --outfile Hy-MT1.5-1.8B-bf16.gguf \ --outtype bf164. Quantize bf16 → STQ1_0
./build/bin/llama-quantize \ Hy-MT1.5-1.8B-bf16.gguf \ Hy-MT1.5-1.8B-STQ1_0.gguf \ STQ1_05. Run a completion (CPU only, with chat template)
./build/bin/llama-completion \ -m Hy-MT1.5-1.8B-STQ1_0.gguf \ -ngl 0 --jinja \ -n 64 \ -p "translate to chinese: hello"Performance
Benchmarked on Apple M4 Pro (12 cores: 8 P + 4 E, 24 GB unified memory, macOS 26.3.1) with
-ngl 0so all layers run on CPU, exercising the ARM NEONvec_dotkernel introduced by this PR.To make the comparison against llama.cpp's existing ternary formats apples-to-apples, we used the community-reproduced Sherry-1B reference checkpoint MoraxGeo/Sherry-1B-1.25bit-per-channel, converted it to bf16 GGUF once, and then re-quantized the same weights into all three ternary formats (
STQ1_0/TQ1_0/TQ2_0) so that any difference below comes purely from the quantization format and its kernel../build/bin/llama-bench \ -m ./model_zoo/Sherry-1B-1.25bit-STQ1_0.gguf,\ ./model_zoo/Sherry-1B-1.25bit-TQ1_0.gguf,\ ./model_zoo/Sherry-1B-1.25bit-TQ2_0.gguf \ -ngl 0Based on the results above, STQ1_0 has the smallest footprint of the three (~11% smaller than TQ1_0, ~20% smaller than TQ2_0) and is faster than TQ1_0. Compared to the 1-bit binary
Q1_0baseline, STQ1_0 only trades a ~6% size increase but still has ~35% highertg128throughput. What's more, per the Sherry paper (Fig. 6), 1.25-bit ternary matches 1.67-bit ternary accuracy at 25% fewer bits, while the 1-bit binary configuration shows a ~3 pp accuracy gap.Requirements