ggml-cpu : add STQ1_0 ternary quantization with ARM NEON vec_dot kernel by sjl623 · Pull Request #22836 · ggml-org/llama.cpp

sjl623 · 2026-05-08T11:07:35Z

Introduction

This PR adds STQ1_0 (Sparse Ternary Quantization), a new hardware-efficient ternary quantization kernel to support Sherry quantization, which recently accepted to ACL 2026 (Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification), where each weight is constrained to {-d, 0, +d} with the structural rule that exactly one of every four lanes is zero, yielding 1.3125 bits per weight (5 bits per 4-weight group, i.e. a 4-bit codebook index + 1-bit sign, plus a single fp16 scale per 256-weight block: 42 B / 256 = 1.3125 bpw) while admitting fast SIMD decode through a 32-entry codebook lookup.

In llama.cpp terms, the 3:4 pattern lets STQ1_0 sit between the existing ternary formats: smaller than TQ2_0 (1.3125 vs 2.0625 bpw) and faster than TQ1_0, whose 1.6875-bit 3-way packing is SIMD-unfriendly; STQ1_0's power-of-two 4-way blocks decode directly through vqtbl2q + vdotq_s32 with no bit-shuffling.

Below is a more concrete walkthrough of how our kernel actually decodes a block: the on-disk layout (qs[32] 4-bit codebook indices + sign[8] 1-bit per group + an fp16 scale), the codebook of 4-ternary lanes, and a step-by-step dequantization example showing codebook lookup → sign flip → scale.

Building on this method, Tencent Hunyuan has applied STQ1_0 to compress the Hy-MT1.5-1.8B model and released it to the open-source community, where it has received broad attention: the model has surpassed 16k+ downloads on Hugging Face within one week (4/29-5/5): AngelSlim/Hy-MT1.5-1.8B-1.25bit. In parallel, Tencent Hunyuan has built an on-device offline translation app on top of llama.cpp; this PR contributes the 1.25-bit kernel implementation that powers it.

There has also been community interest in adding Sherry support to llama.cpp from both directions: users have asked for this format on the llama.cpp side (discussion #19123) and on the model card side (Hy-MT1.5-1.8B-1.25bit / discussions / 3, specifically requesting the kernel). This PR is the upstream answer to those asks.

STQ1_0: Stride-16 Sparsity Layout

Notably, STQ1_0's 3:4 sparsity adopts a stride-16 grouping pattern, which is key to achieving efficient SIMD execution without extra data shuffling. Specifically, instead of grouping 4 consecutive weights (w0, w1, w2, w3), STQ1_0 groups weights that are stride-16 within each 64-weight chunk (e.g. w0, w16, w32, w48), where exactly one of the four is zero and the other three take values in {−d, +d}.

The motivation is SIMD alignment. After the codebook unpack (SHR 0/2/4/6 + mask), each NEON lane register (sqx0–sqx3) holds a contiguous run of weight indices. Since standard Q8_K activation quantization stores y values sequentially in memory, the lane contents and y are naturally aligned — a plain vld1q_s8 at offsets 0/16/32/48 is all that is needed, with no deinterleave or repack required.

Acknowledge

Special thanks to @Little0o0 for proposing the Sherry method and for the helpful discussions on adapting its 3:4 packing layout to llama.cpp.

How to use it

1. Clone and build

cd llama.cpp
cmake -B build
cmake --build build --config Release -j

2. Download the HF model

pip install huggingface_hub
huggingface-cli download AngelSlim/Hy-MT1.5-1.8B-1.25bit \
    --local-dir Hy-MT1.5-1.8B-1.25bit

Model card: AngelSlim/Hy-MT1.5-1.8B-1.25bit.

3. Convert HF → GGUF (bf16)

python convert_hf_to_gguf.py Hy-MT1.5-1.8B-1.25bit \
    --outfile Hy-MT1.5-1.8B-bf16.gguf \
    --outtype bf16

4. Quantize bf16 → STQ1_0

./build/bin/llama-quantize \
    Hy-MT1.5-1.8B-bf16.gguf \
    Hy-MT1.5-1.8B-STQ1_0.gguf \
    STQ1_0

5. Run a completion (CPU only, with chat template)

./build/bin/llama-completion \
    -m Hy-MT1.5-1.8B-STQ1_0.gguf \
    -ngl 0 --jinja \
    -n 64 \
    -p "translate to chinese: hello"

Performance

Benchmarked on Apple M4 Pro (12 cores: 8 P + 4 E, 24 GB unified memory, macOS 26.3.1) with -ngl 0 so all layers run on CPU, exercising the ARM NEON vec_dot kernel introduced by this PR.

To make the comparison against llama.cpp's existing ternary formats apples-to-apples, we used the community-reproduced Sherry-1B reference checkpoint MoraxGeo/Sherry-1B-1.25bit-per-channel, converted it to bf16 GGUF once, and then re-quantized the same weights into all three ternary formats (STQ1_0 / TQ1_0 / TQ2_0) so that any difference below comes purely from the quantization format and its kernel.

./build/bin/llama-bench \
    -m ./model_zoo/Sherry-1B-1.25bit-STQ1_0.gguf,\
./model_zoo/Sherry-1B-1.25bit-TQ1_0.gguf,\
./model_zoo/Sherry-1B-1.25bit-TQ2_0.gguf \
    -ngl 0

model	size	params	threads	test	t/s
llama 1B STQ1_0 — 1.31 bpw	358.00 MiB	1.24 B	8	pp512	732.69 ± 20.00
llama 1B STQ1_0 — 1.31 bpw	358.00 MiB	1.24 B	8	tg128	147.47 ± 1.36
llama 1B Q1_0	336.25 MiB	1.24 B	8	pp512	768.47 ± 14.75
llama 1B Q1_0	336.25 MiB	1.24 B	8	tg128	109.62 ± 16.93
llama 1B TQ1_0 — 1.69 bpw	401.50 MiB	1.24 B	8	pp512	728.69 ± 19.88
llama 1B TQ1_0 — 1.69 bpw	401.50 MiB	1.24 B	8	tg128	138.87 ± 0.96
llama 1B TQ2_0 — 2.06 bpw	445.00 MiB	1.24 B	8	pp512	689.25 ± 16.61
llama 1B TQ2_0 — 2.06 bpw	445.00 MiB	1.24 B	8	tg128	175.06 ± 1.30

Based on the results above, STQ1_0 has the smallest footprint of the three (~11% smaller than TQ1_0, ~20% smaller than TQ2_0) and is faster than TQ1_0. Compared to the 1-bit binary Q1_0 baseline, STQ1_0 only trades a ~6% size increase but still has ~35% higher tg128 throughput. What's more, per the Sherry paper (Fig. 6), 1.25-bit ternary matches 1.67-bit ternary accuracy at 25% fewer bits, while the 1-bit binary configuration shows a ~3 pp accuracy gap.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes — for navigating the llama.cpp codebase and for sentence polishing of this PR description.

CISC · 2026-05-08T13:20:10Z

This seems interesting, @khosravipasha something for you?

Naming-wise it should probably be STQ1_0.

Green-Sky · 2026-05-08T14:57:29Z

@sjl623 If you are doing benchmarks, you could also add Q1_0 to the list, since it is another close quant with 1.125 bits-per-weight.

sjl623 · 2026-05-08T16:40:18Z

This seems interesting, @khosravipasha something for you?

Naming-wise it should probably be STQ1_0.

Hi @CISC, thanks for taking a look! We've renamed the format to STQ1_0 according to your suggestion. Happy to hear further thoughts ：）

sjl623 · 2026-05-08T16:51:38Z

@sjl623 If you are doing benchmarks, you could also add Q1_0 to the list, since it is another close quant with 1.125 bits-per-weight.

Good call, thanks. I've added Q1_0 to the benchmark table. STQ1_0 actually still comes out ahead on tg128, which was a bit unexpected given Q1_0 is binary and has lower bpw. Separately, per the Sherry paper, the 1.25-bit ternary setting also has better accuracy than the 1-bit binary baseline.

sjl623 · 2026-05-08T17:13:37Z

Hi @ggerganov, could you take a look at this PR when you have time? Thanks!

cherish-ltt · 2026-05-09T06:42:23Z

sudo ./build/bin/llama-completion --model /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st


warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
common_fit_params: fitting params to free memory took 0.08 seconds
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf'
main: error: unable to create context

Compile llama.cpp according to doc, run it with an error, hasn't the model been updated yet

sjl623 · 2026-05-09T07:06:06Z

sudo ./build/bin/llama-completion --model /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st


warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
common_fit_params: fitting params to free memory took 0.08 seconds
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf'
main: error: unable to create context

Compile llama.cpp according to doc, run it with an error, hasn't the model been updated yet

Hi, thanks for the report! Quick check: did you download the .gguf from HF directly, or convert one yourself per the docs?

The HF gguf was produced with an older quant type ID and isn't loadable as-is. Please re-convert following the doc instructions (we'll refresh the HF weights later).

Also note: currently only ARM (Apple M-series) is supported; x86 isn't yet.

cherish-ltt · 2026-05-09T07:11:57Z

sudo ./build/bin/llama-completion --model /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st


warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
common_fit_params: fitting params to free memory took 0.08 seconds
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203154464, expected 203572256
gguf_init_from_file_ptr: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/xxx/xxx/Hy-MT1.5-1.8B-1.25bit.gguf'
main: error: unable to create context
Compile llama.cpp according to doc, run it with an error, hasn't the model been updated yet按照文档编译 llama.cpp，运行时有错误，模型还没更新吗
Hi, thanks for the report! Quick check: did you download the .gguf from HF directly, or convert one yourself per the docs?你好，谢谢你的报告！快速确认一下：你是直接从 HF 下载的 .gguf 文件，还是根据文档自己转换的？

The HF gguf was produced with an older quant type ID and isn't loadable as-is. Please re-convert following the doc instructions (we'll refresh the HF weights later).HF gguf 是用较早的量化类型 ID 生产的，不能直接装填。请按照文档说明重新转换（我们稍后会刷新 HF 权重）。

Also note: currently only ARM (Apple M-series) is supported; x86 isn't yet.另外注意：目前仅支持 ARM（Apple M 系列）;x86 还没有。

I downloaded gguf directly from HF, Llama.cpp was compiled by myself, and my platform is x86
Thank you for your work

Little0o0 · 2026-05-09T10:13:11Z

I downloaded gguf directly from HF, Llama.cpp was compiled by myself, and my platform is x86 Thank you for your work

@cherish-ltt gguf file on HF did not update yesterday. It is updated now. And I think STQ1_0 do not support x86 currently ?

dima-xd · 2026-05-10T12:12:17Z

Using Android app demo from https://huggingface.co/tencent/Hy-MT1.5-1.8B-1.25bit-GGUF:

15:07:56.537  E  gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has invalid ggml type 42. should be in [0, 42)
15:07:56.537  E  gguf_init_from_file_impl: failed to read tensor info
15:07:56.551  E  llama_model_load: error loading model: llama_model_loader: failed to load model from /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
15:07:56.551  E  llama_model_load_from_file_impl: failed to load model
15:07:56.551  E  Error loading model
                 /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
                 com.arm.aichat.UnsupportedArchitectureException

Little0o0 · 2026-05-10T14:08:20Z

Using Android app demo from https://huggingface.co/tencent/Hy-MT1.5-1.8B-1.25bit-GGUF:

15:07:56.537  E  gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has invalid ggml type 42. should be in [0, 42)
15:07:56.537  E  gguf_init_from_file_impl: failed to read tensor info
15:07:56.551  E  llama_model_load: error loading model: llama_model_loader: failed to load model from /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
15:07:56.551  E  llama_model_load_from_file_impl: failed to load model
15:07:56.551  E  Error loading model
                 /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
                 com.arm.aichat.UnsupportedArchitectureException

@dima-xd , The source gguf of APK was replaced by mistake yesterday, now it should be ok. Just reinstall and redownload the gguf.

dima-xd · 2026-05-10T14:14:28Z

Using Android app demo from https://huggingface.co/tencent/Hy-MT1.5-1.8B-1.25bit-GGUF:

15:07:56.537  E  gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has invalid ggml type 42. should be in [0, 42)
15:07:56.537  E  gguf_init_from_file_impl: failed to read tensor info
15:07:56.551  E  llama_model_load: error loading model: llama_model_loader: failed to load model from /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
15:07:56.551  E  llama_model_load_from_file_impl: failed to load model
15:07:56.551  E  Error loading model
                 /data/user/0/com.tencent.hunyuan.angelslim/files/models/HY1.8B-MT-1.25bit.gguf
                 com.arm.aichat.UnsupportedArchitectureException

@dima-xd , The source gguf of APK is replaced by mistake, now it should be ok.

Thanks, it works now!

tritueviet · 2026-05-11T15:44:20Z

./build/bin/llama-completion
--model ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf
-p "Translate the following segment into Chinese, without additional explanation：Hello"
--jinja
-ngl 0
-n 64 -st
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
common_memory_breakdown_print: | - Host | 17607 = 435 + 16384 + 788 |
common_params_fit_impl: projected to use 17607 MiB of host memory vs. 15908 MiB of total host memory
common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 2723 MiB
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 |
common_params_fit_impl: context size reduced from 262144 to 219904 -> need 2728 MiB less memory in total
common_params_fit_impl: entire model can be fit by reducing context
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.55 seconds
llama_model_loader: loaded meta data with 40 key-value pairs and 354 tensors from ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = hunyuan-dense
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000
llama_model_loader: - kv 5: general.name str = Hy MT1.5 1.8B 1.25bit
llama_model_loader: - kv 6: general.finetune str = 1.25bit
llama_model_loader: - kv 7: general.basename str = Hy-MT1.5
llama_model_loader: - kv 8: general.size_label str = 1.8B
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = HY MT1.5 1.8B
llama_model_loader: - kv 11: general.base_model.0.organization str = Tencent
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/tencent/HY-MT1...
llama_model_loader: - kv 13: general.tags arr[str,5] = ["translation", "hy-mt", "quant", "1....
llama_model_loader: - kv 14: general.languages arr[str,1] = ["multilingual"]
llama_model_loader: - kv 15: hunyuan-dense.block_count u32 = 32
llama_model_loader: - kv 16: hunyuan-dense.context_length u32 = 262144
llama_model_loader: - kv 17: hunyuan-dense.embedding_length u32 = 2048
llama_model_loader: - kv 18: hunyuan-dense.feed_forward_length u32 = 6144
llama_model_loader: - kv 19: hunyuan-dense.attention.head_count u32 = 16
llama_model_loader: - kv 20: hunyuan-dense.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: hunyuan-dense.rope.freq_base f32 = 11158840.000000
llama_model_loader: - kv 22: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: hunyuan-dense.attention.key_length u32 = 128
llama_model_loader: - kv 24: hunyuan-dense.attention.value_length u32 = 128
llama_model_loader: - kv 25: hunyuan-dense.rope.scaling.type str = none
llama_model_loader: - kv 26: hunyuan-dense.rope.scaling.factor f32 = 1.000000
llama_model_loader: - kv 27: hunyuan-dense.rope.scaling.original_context_length u32 = 262144
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = hunyuan-dense
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,120818] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 120000
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 120020
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 120002
llama_model_loader: - kv 36: tokenizer.ggml.seperator_token_id u32 = 120007
llama_model_loader: - kv 37: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 41
llama_model_loader: - type f32: 129 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type stq1_0: 224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = STQ1_0 - 1.31 bpw ternary
print_info: file size = 435.61 MiB (2.04 BPW)
load: 0 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 120020 ('<｜hy_place▁holder▁no▁2｜>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8089 MB
print_info: arch = hunyuan-dense
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 32
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.0000
print_info: n_ff = 6144
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = none
print_info: freq_base_train = 11158840.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 1.8B
print_info: model params = 1.79 B
print_info: general.name = Hy MT1.5 1.8B 1.25bit
print_info: vocab type = BPE
print_info: n_vocab = 120818
print_info: n_merges = 119758
print_info: BOS token = 120000 '<｜hy_begin▁of▁sentence｜>'
print_info: EOS token = 120020 '<｜hy_place▁holder▁no▁2｜>'
print_info: SEP token = 120007 '<｜hy_Assistant｜>'
print_info: PAD token = 120002 '<｜hy_▁pad▁｜>'
print_info: LF token = 185 'Ċ'
print_info: EOG token = 120020 '<｜hy_place▁holder▁no▁2｜>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: CPU_Mapped model buffer size = 435.61 MiB
.........................................................
common_init_result: added <｜hy_place▁holder▁no▁2｜> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 219904
llama_context: n_ctx_seq = 219904
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 11158840.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (219904) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.46 MiB
llama_kv_cache: CPU KV buffer size = 13744.00 MiB
Killed

still error

sjl623 · 2026-05-11T17:15:33Z

./build/bin/llama-completion --model ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st warning: no usable GPU found, --gpu-layers option will be ignored warning: one possible reason is that llama.cpp was compiled without GPU support warning: consult docs/build.md for compilation instructions main: llama backend init main: load the model and apply lora adapter, if any common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on common_params_fit_impl: getting device memory data for initial parameters: common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 17607 = 435 + 16384 + 788 | common_params_fit_impl: projected to use 17607 MiB of host memory vs. 15908 MiB of total host memory common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 2723 MiB common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 | common_params_fit_impl: context size reduced from 262144 to 219904 -> need 2728 MiB less memory in total common_params_fit_impl: entire model can be fit by reducing context common_fit_params: successfully fit params to free device memory common_fit_params: fitting params to free memory took 0.55 seconds llama_model_loader: loaded meta data with 40 key-value pairs and 354 tensors from ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = hunyuan-dense llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.sampling.top_k i32 = 20 llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000 llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000 llama_model_loader: - kv 5: general.name str = Hy MT1.5 1.8B 1.25bit llama_model_loader: - kv 6: general.finetune str = 1.25bit llama_model_loader: - kv 7: general.basename str = Hy-MT1.5 llama_model_loader: - kv 8: general.size_label str = 1.8B llama_model_loader: - kv 9: general.base_model.count u32 = 1 llama_model_loader: - kv 10: general.base_model.0.name str = HY MT1.5 1.8B llama_model_loader: - kv 11: general.base_model.0.organization str = Tencent llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/tencent/HY-MT1... llama_model_loader: - kv 13: general.tags arr[str,5] = ["translation", "hy-mt", "quant", "1.... llama_model_loader: - kv 14: general.languages arr[str,1] = ["multilingual"] llama_model_loader: - kv 15: hunyuan-dense.block_count u32 = 32 llama_model_loader: - kv 16: hunyuan-dense.context_length u32 = 262144 llama_model_loader: - kv 17: hunyuan-dense.embedding_length u32 = 2048 llama_model_loader: - kv 18: hunyuan-dense.feed_forward_length u32 = 6144 llama_model_loader: - kv 19: hunyuan-dense.attention.head_count u32 = 16 llama_model_loader: - kv 20: hunyuan-dense.attention.head_count_kv u32 = 4 llama_model_loader: - kv 21: hunyuan-dense.rope.freq_base f32 = 11158840.000000 llama_model_loader: - kv 22: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 23: hunyuan-dense.attention.key_length u32 = 128 llama_model_loader: - kv 24: hunyuan-dense.attention.value_length u32 = 128 llama_model_loader: - kv 25: hunyuan-dense.rope.scaling.type str = none llama_model_loader: - kv 26: hunyuan-dense.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 27: hunyuan-dense.rope.scaling.original_context_length u32 = 262144 llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = hunyuan-dense llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,120818] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 120000 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 120020 llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 120002 llama_model_loader: - kv 36: tokenizer.ggml.seperator_token_id u32 = 120007 llama_model_loader: - kv 37: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 41 llama_model_loader: - type f32: 129 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type stq1_0: 224 tensors print_info: file format = GGUF V3 (latest) print_info: file type = STQ1_0 - 1.31 bpw ternary print_info: file size = 435.61 MiB (2.04 BPW) load: 0 unused tokens load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 120020 ('<｜hy_place▁holder▁no▁2｜>') load: special tokens cache size = 818 load: token to piece cache size = 0.8089 MB print_info: arch = hunyuan-dense print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_embd_inp = 2048 print_info: n_layer = 32 print_info: n_head = 16 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: f_attn_value_scale = 0.0000 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = none print_info: freq_base_train = 11158840.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_yarn_log_mul = 0.0000 print_info: rope_finetuned = unknown print_info: model type = 1.8B print_info: model params = 1.79 B print_info: general.name = Hy MT1.5 1.8B 1.25bit print_info: vocab type = BPE print_info: n_vocab = 120818 print_info: n_merges = 119758 print_info: BOS token = 120000 '<｜hy_begin▁of▁sentence｜>' print_info: EOS token = 120020 '<｜hy_place▁holder▁no▁2｜>' print_info: SEP token = 120007 '<｜hy_Assistant｜>' print_info: PAD token = 120002 '<｜hy_▁pad▁｜>' print_info: LF token = 185 'Ċ' print_info: EOG token = 120020 '<｜hy_place▁holder▁no▁2｜>' print_info: max token length = 1024 load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false) load_tensors: CPU_Mapped model buffer size = 435.61 MiB ......................................................... common_init_result: added <｜hy_place▁holder▁no▁2｜> logit bias = -inf llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 219904 llama_context: n_ctx_seq = 219904 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 11158840.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (219904) < n_ctx_train (262144) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.46 MiB llama_kv_cache: CPU KV buffer size = 13744.00 MiB Killed

still error

Looks like it's being killed due to OOM. The default context (262144) makes the KV cache ~16 GiB, which may exceed host memory.

Could you try again with -c 4096 ? That brings the KV cache down to ~256 MiB, plenty for a short translation:

./build/bin/llama-completion --model ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf \
-p "Translate the following segment into Chinese, without additional explanation：Hello" \
--jinja -ngl 0 -c 4096 -n 64 -st

tritueviet · 2026-05-12T04:15:16Z

./build/bin/llama-completion --model ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st warning: no usable GPU found, --gpu-layers option will be ignored warning: one possible reason is that llama.cpp was compiled without GPU support warning: consult docs/build.md for compilation instructions main: llama backend init main: load the model and apply lora adapter, if any common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on common_params_fit_impl: getting device memory data for initial parameters: common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 17607 = 435 + 16384 + 788 | common_params_fit_impl: projected to use 17607 MiB of host memory vs. 15908 MiB of total host memory common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 2723 MiB common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 | common_params_fit_impl: context size reduced from 262144 to 219904 -> need 2728 MiB less memory in total common_params_fit_impl: entire model can be fit by reducing context common_fit_params: successfully fit params to free device memory common_fit_params: fitting params to free memory took 0.55 seconds llama_model_loader: loaded meta data with 40 key-value pairs and 354 tensors from ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = hunyuan-dense llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.sampling.top_k i32 = 20 llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000 llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000 llama_model_loader: - kv 5: general.name str = Hy MT1.5 1.8B 1.25bit llama_model_loader: - kv 6: general.finetune str = 1.25bit llama_model_loader: - kv 7: general.basename str = Hy-MT1.5 llama_model_loader: - kv 8: general.size_label str = 1.8B llama_model_loader: - kv 9: general.base_model.count u32 = 1 llama_model_loader: - kv 10: general.base_model.0.name str = HY MT1.5 1.8B llama_model_loader: - kv 11: general.base_model.0.organization str = Tencent llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/tencent/HY-MT1... llama_model_loader: - kv 13: general.tags arr[str,5] = ["translation", "hy-mt", "quant", "1.... llama_model_loader: - kv 14: general.languages arr[str,1] = ["multilingual"] llama_model_loader: - kv 15: hunyuan-dense.block_count u32 = 32 llama_model_loader: - kv 16: hunyuan-dense.context_length u32 = 262144 llama_model_loader: - kv 17: hunyuan-dense.embedding_length u32 = 2048 llama_model_loader: - kv 18: hunyuan-dense.feed_forward_length u32 = 6144 llama_model_loader: - kv 19: hunyuan-dense.attention.head_count u32 = 16 llama_model_loader: - kv 20: hunyuan-dense.attention.head_count_kv u32 = 4 llama_model_loader: - kv 21: hunyuan-dense.rope.freq_base f32 = 11158840.000000 llama_model_loader: - kv 22: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 23: hunyuan-dense.attention.key_length u32 = 128 llama_model_loader: - kv 24: hunyuan-dense.attention.value_length u32 = 128 llama_model_loader: - kv 25: hunyuan-dense.rope.scaling.type str = none llama_model_loader: - kv 26: hunyuan-dense.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 27: hunyuan-dense.rope.scaling.original_context_length u32 = 262144 llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = hunyuan-dense llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,120818] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 120000 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 120020 llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 120002 llama_model_loader: - kv 36: tokenizer.ggml.seperator_token_id u32 = 120007 llama_model_loader: - kv 37: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 41 llama_model_loader: - type f32: 129 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type stq1_0: 224 tensors print_info: file format = GGUF V3 (latest) print_info: file type = STQ1_0 - 1.31 bpw ternary print_info: file size = 435.61 MiB (2.04 BPW) load: 0 unused tokens load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 120020 ('<｜hy_place▁holder▁no▁2｜>') load: special tokens cache size = 818 load: token to piece cache size = 0.8089 MB print_info: arch = hunyuan-dense print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_embd_inp = 2048 print_info: n_layer = 32 print_info: n_head = 16 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: f_attn_value_scale = 0.0000 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = none print_info: freq_base_train = 11158840.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_yarn_log_mul = 0.0000 print_info: rope_finetuned = unknown print_info: model type = 1.8B print_info: model params = 1.79 B print_info: general.name = Hy MT1.5 1.8B 1.25bit print_info: vocab type = BPE print_info: n_vocab = 120818 print_info: n_merges = 119758 print_info: BOS token = 120000 '<｜hy_begin▁of▁sentence｜>' print_info: EOS token = 120020 '<｜hy_place▁holder▁no▁2｜>' print_info: SEP token = 120007 '<｜hy_Assistant｜>' print_info: PAD token = 120002 '<｜hy_▁pad▁｜>' print_info: LF token = 185 'Ċ' print_info: EOG token = 120020 '<｜hy_place▁holder▁no▁2｜>' print_info: max token length = 1024 load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false) load_tensors: CPU_Mapped model buffer size = 435.61 MiB ......................................................... common_init_result: added <｜hy_place▁holder▁no▁2｜> logit bias = -inf llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 219904 llama_context: n_ctx_seq = 219904 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 11158840.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (219904) < n_ctx_train (262144) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.46 MiB llama_kv_cache: CPU KV buffer size = 13744.00 MiB Killed
still error

Looks like it's being killed due to OOM. The default context (262144) makes the KV cache ~16 GiB, which may exceed host memory.

Could you try again with -c 4096 ? That brings the KV cache down to ~256 MiB, plenty for a short translation:
./build/bin/llama-completion --model ../model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf \
-p "Translate the following segment into Chinese, without additional explanation：Hello" \
--jinja -ngl 0 -c 4096 -n 64 -st

./build/bin/llama-completion --model model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf -p "Translate the following segment into Vietnamese, without additional explanation. Hello " --jinja -ngl 0 -c 4096 -n 64 -st
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 |
common_params_fit_impl: projected to use 939 MiB of host memory vs. 15625 MiB of total host memory
common_params_fit_impl: will leave 14686 >= 1024 MiB of system memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.21 seconds
llama_model_loader: loaded meta data with 40 key-value pairs and 354 tensors from model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = hunyuan-dense
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000
llama_model_loader: - kv 5: general.name str = Hy MT1.5 1.8B 1.25bit
llama_model_loader: - kv 6: general.finetune str = 1.25bit
llama_model_loader: - kv 7: general.basename str = Hy-MT1.5
llama_model_loader: - kv 8: general.size_label str = 1.8B
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = HY MT1.5 1.8B
llama_model_loader: - kv 11: general.base_model.0.organization str = Tencent
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/tencent/HY-MT1...
llama_model_loader: - kv 13: general.tags arr[str,5] = ["translation", "hy-mt", "quant", "1....
llama_model_loader: - kv 14: general.languages arr[str,1] = ["multilingual"]
llama_model_loader: - kv 15: hunyuan-dense.block_count u32 = 32
llama_model_loader: - kv 16: hunyuan-dense.context_length u32 = 262144
llama_model_loader: - kv 17: hunyuan-dense.embedding_length u32 = 2048
llama_model_loader: - kv 18: hunyuan-dense.feed_forward_length u32 = 6144
llama_model_loader: - kv 19: hunyuan-dense.attention.head_count u32 = 16
llama_model_loader: - kv 20: hunyuan-dense.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: hunyuan-dense.rope.freq_base f32 = 11158840.000000
llama_model_loader: - kv 22: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: hunyuan-dense.attention.key_length u32 = 128
llama_model_loader: - kv 24: hunyuan-dense.attention.value_length u32 = 128
llama_model_loader: - kv 25: hunyuan-dense.rope.scaling.type str = none
llama_model_loader: - kv 26: hunyuan-dense.rope.scaling.factor f32 = 1.000000
llama_model_loader: - kv 27: hunyuan-dense.rope.scaling.original_context_length u32 = 262144
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = hunyuan-dense
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,120818] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 120000
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 3
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 120002
llama_model_loader: - kv 36: tokenizer.ggml.seperator_token_id u32 = 120007
llama_model_loader: - kv 37: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 41
llama_model_loader: - type f32: 129 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type stq1_0: 224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = STQ1_0 - 1.31 bpw ternary
print_info: file size = 435.61 MiB (2.04 BPW)
load: 0 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 3 ('$')
load: special tokens cache size = 818
load: token to piece cache size = 0.8089 MB
print_info: arch = hunyuan-dense
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 32
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.0000
print_info: n_ff = 6144
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = none
print_info: freq_base_train = 11158840.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 1.8B
print_info: model params = 1.79 B
print_info: general.name = Hy MT1.5 1.8B 1.25bit
print_info: vocab type = BPE
print_info: n_vocab = 120818
print_info: n_merges = 119758
print_info: BOS token = 120000 '<｜hy_begin▁of▁sentence｜>'
print_info: EOS token = 3 '$'
print_info: SEP token = 120007 '<｜hy_Assistant｜>'
print_info: PAD token = 120002 '<｜hy_▁pad▁｜>'
print_info: LF token = 185 'Ċ'
print_info: EOG token = 3 '$'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: CPU_Mapped model buffer size = 435.61 MiB
.........................................................
common_init_result: added $ logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 11158840.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.46 MiB
llama_kv_cache: CPU KV buffer size = 256.00 MiB
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CPU compute buffer size = 247.97 MiB
sched_reserve: graph nodes = 1127
sched_reserve: graph splits = 1
sched_reserve: reserve took 2.11 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<｜hy_begin▁of▁sentence｜>You are a helpful assistant<｜hy_place▁holder▁no▁3｜><｜hy_User｜>Hello<｜hy_Assistant｜>Hi there<｜hy_place▁holder▁no▁2｜><｜hy_User｜>How are you?<｜hy_Assistant｜>

system_info: n_threads = 4 (n_threads_batch = 4) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

sampler seed: 3567722247
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 0

Translate the following segment into Vietnamese, without additional explanation. Hello Xin chào.َايْتُمْ بَعْضُنُمْ بَعْلَيٰ ٌبَعْلَيٰ ٌبَعْلَيٰ… الْمُنْ

common_perf_print: sampling time = 13.33 ms
common_perf_print: samplers time = 4.08 ms / 80 tokens
common_perf_print: load time = 830.89 ms
common_perf_print: prompt eval time = 4459.01 ms / 16 tokens ( 278.69 ms per token, 3.59 tokens per second)
common_perf_print: eval time = 18632.27 ms / 63 runs ( 295.75 ms per token, 3.38 tokens per second)
common_perf_print: total time = 23108.99 ms / 79 tokens
common_perf_print: unaccounted time = 4.38 ms / 0.0 % (total - sampling - prompt eval - eval) / (total)
common_perf_print: graphs reused = 62
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 |

performance and result not good. it only run with arm??

sjl623 · 2026-05-12T06:08:11Z

./build/bin/llama-completion --model model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf -p "Translate the following segment into Vietnamese, without additional explanation. Hello " --jinja -ngl 0 -c 4096 -n 64 -st warning: no usable GPU found, --gpu-layers option will be ignored warning: one possible reason is that llama.cpp was compiled without GPU support warning: consult docs/build.md for compilation instructions main: llama backend init main: load the model and apply lora adapter, if any common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on common_params_fit_impl: getting device memory data for initial parameters: common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 | common_params_fit_impl: projected to use 939 MiB of host memory vs. 15625 MiB of total host memory common_params_fit_impl: will leave 14686 >= 1024 MiB of system memory, no changes needed common_fit_params: successfully fit params to free device memory common_fit_params: fitting params to free memory took 0.21 seconds llama_model_loader: loaded meta data with 40 key-value pairs and 354 tensors from model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = hunyuan-dense llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.sampling.top_k i32 = 20 llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000 llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000 llama_model_loader: - kv 5: general.name str = Hy MT1.5 1.8B 1.25bit llama_model_loader: - kv 6: general.finetune str = 1.25bit llama_model_loader: - kv 7: general.basename str = Hy-MT1.5 llama_model_loader: - kv 8: general.size_label str = 1.8B llama_model_loader: - kv 9: general.base_model.count u32 = 1 llama_model_loader: - kv 10: general.base_model.0.name str = HY MT1.5 1.8B llama_model_loader: - kv 11: general.base_model.0.organization str = Tencent llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/tencent/HY-MT1... llama_model_loader: - kv 13: general.tags arr[str,5] = ["translation", "hy-mt", "quant", "1.... llama_model_loader: - kv 14: general.languages arr[str,1] = ["multilingual"] llama_model_loader: - kv 15: hunyuan-dense.block_count u32 = 32 llama_model_loader: - kv 16: hunyuan-dense.context_length u32 = 262144 llama_model_loader: - kv 17: hunyuan-dense.embedding_length u32 = 2048 llama_model_loader: - kv 18: hunyuan-dense.feed_forward_length u32 = 6144 llama_model_loader: - kv 19: hunyuan-dense.attention.head_count u32 = 16 llama_model_loader: - kv 20: hunyuan-dense.attention.head_count_kv u32 = 4 llama_model_loader: - kv 21: hunyuan-dense.rope.freq_base f32 = 11158840.000000 llama_model_loader: - kv 22: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 23: hunyuan-dense.attention.key_length u32 = 128 llama_model_loader: - kv 24: hunyuan-dense.attention.value_length u32 = 128 llama_model_loader: - kv 25: hunyuan-dense.rope.scaling.type str = none llama_model_loader: - kv 26: hunyuan-dense.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 27: hunyuan-dense.rope.scaling.original_context_length u32 = 262144 llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = hunyuan-dense llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,120818] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 120000 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 3 llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 120002 llama_model_loader: - kv 36: tokenizer.ggml.seperator_token_id u32 = 120007 llama_model_loader: - kv 37: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 41 llama_model_loader: - type f32: 129 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type stq1_0: 224 tensors print_info: file format = GGUF V3 (latest) print_info: file type = STQ1_0 - 1.31 bpw ternary print_info: file size = 435.61 MiB (2.04 BPW) load: 0 unused tokens load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 3 ('$') load: special tokens cache size = 818 load: token to piece cache size = 0.8089 MB print_info: arch = hunyuan-dense print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_embd_inp = 2048 print_info: n_layer = 32 print_info: n_head = 16 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: f_attn_value_scale = 0.0000 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = none print_info: freq_base_train = 11158840.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_yarn_log_mul = 0.0000 print_info: rope_finetuned = unknown print_info: model type = 1.8B print_info: model params = 1.79 B print_info: general.name = Hy MT1.5 1.8B 1.25bit print_info: vocab type = BPE print_info: n_vocab = 120818 print_info: n_merges = 119758 print_info: BOS token = 120000 '<｜hy_begin▁of▁sentence｜>' print_info: EOS token = 3 '$' print_info: SEP token = 120007 '<｜hy_Assistant｜>' print_info: PAD token = 120002 '<｜hy_▁pad▁｜>' print_info: LF token = 185 'Ċ' print_info: EOG token = 3 '$' print_info: max token length = 1024 load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false) load_tensors: CPU_Mapped model buffer size = 435.61 MiB ......................................................... common_init_result: added $ logit bias = -inf llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 11158840.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.46 MiB llama_kv_cache: CPU KV buffer size = 256.00 MiB llama_kv_cache: size = 256.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128 llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128 sched_reserve: reserving ... sched_reserve: Flash Attention was auto, set to enabled sched_reserve: resolving fused Gated Delta Net support: sched_reserve: fused Gated Delta Net (autoregressive) enabled sched_reserve: fused Gated Delta Net (chunked) enabled sched_reserve: CPU compute buffer size = 247.97 MiB sched_reserve: graph nodes = 1127 sched_reserve: graph splits = 1 sched_reserve: reserve took 2.11 ms, sched copies = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 4 main: chat template is available, enabling conversation mode (disable it with -no-cnv) *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead? main: chat template example: <｜hy_begin▁of▁sentence｜>You are a helpful assistant<｜hy_place▁holder▁no▁3｜><｜hy_User｜>Hello<｜hy_Assistant｜>Hi there<｜hy_place▁holder▁no▁2｜><｜hy_User｜>How are you?<｜hy_Assistant｜>

system_info: n_threads = 4 (n_threads_batch = 4) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

sampler seed: 3567722247 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900 sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 0

Translate the following segment into Vietnamese, without additional explanation. Hello Xin chào.َايْتُمْ بَعْضُنُمْ بَعْلَيٰ ٌبَعْلَيٰ ٌبَعْلَيٰ… الْمُنْ

common_perf_print: sampling time = 13.33 ms common_perf_print: samplers time = 4.08 ms / 80 tokens common_perf_print: load time = 830.89 ms common_perf_print: prompt eval time = 4459.01 ms / 16 tokens ( 278.69 ms per token, 3.59 tokens per second) common_perf_print: eval time = 18632.27 ms / 63 runs ( 295.75 ms per token, 3.38 tokens per second) common_perf_print: total time = 23108.99 ms / 79 tokens common_perf_print: unaccounted time = 4.38 ms / 0.0 % (total - sampling - prompt eval - eval) / (total) common_perf_print: graphs reused = 62 common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 939 = 435 + 256 + 247 |

performance and result not good. it only run with arm??

Yes, currently the acceleration only supports ARM chips. You can try it on an M-series MacBook, or download our APK to experience it directly.

tritueviet · 2026-05-12T15:50:19Z

approved.

CISC · 2026-05-22T11:15:24Z

I don't think we can justify the introduction of GGML_TYPE_Q8_K_PLANAR.

sjl623 · 2026-05-26T09:56:53Z

I don't think we can justify the introduction of GGML_TYPE_Q8_K_PLANAR.

Thanks for the feedback! We've retrained the model with stride-16 3:4 sparsity so that each NEON lane naturally aligns with the standard Q8_K layout after unpack. No repack, no deinterleave, no new type. You can find more details in the "STQ1_0: Stride-16 Sparsity Layout" section of the PR description. Please let us know if there are any further concerns :)

sjl623 · 2026-06-01T06:14:41Z

Hi @CISC @ggerganov , CI is green now on my fork https://github.com/sjl623/llama.cpp/actions/runs/26734514829. There's also some interest from the community — e.g.

[Feature Request] Support for Tencent Hy-MT2-30B (Requires upstream STQ kernel PR #22836)？ JamePeng/llama-cpp-python#134 and
How do I create a Modelfile for ollama? Tencent/AngelSlim#316.

Could you let me know how best to move this forward? Happy to make any changes needed. Thanks!

LFF28 · 2026-06-03T06:10:19Z

Build and Test Report (GCC14 + DOTPROD Enabled)

@sjl623 Hi, I'm getting incorrect translation outputs when using the 1.25-bit GGUF quantization (both Neon and scalar). Could you please take a look?

Scope

Repo: llama.cpp
Commit: 797583a
Date: 2026-06-03
Built targets: llama-completion, llama-bench

Model Checksums (SHA256)

All models are downloaded.
./Hy-MT2-1.8B-1.25Bit.gguf: fda3e7462018e35188356b2cbb0726ea18ec9c4f104c357f6232c3f780df4135
./Hy-MT1.5-1.8B-1.25bit.gguf: 987121bc98dd7107078019f63e72447d67b224efe97da75811e74529e22e3525

Environment

OS: Linux 6.17.0-1014-nvidia (Ubuntu), aarch64
CPU: 20 cores (Cortex-X925 + Cortex-A725)
Memory: 119 GiB RAM, 15 GiB swap
Toolchain: cmake 3.28.3, gcc/g++ 14.2.0

Build (GCC14, forced ARM dotprod)

Command:

cmake -S . -B build-gcc14-dotprod \
  -DCMAKE_C_COMPILER=gcc-14 \
  -DCMAKE_CXX_COMPILER=g++-14 \
  -DGGML_NATIVE=OFF \
  -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod && \
cmake --build build-gcc14-dotprod --target llama-completion llama-bench -j"$(nproc)"

Result:

Success: both targets built
- Built target llama-completion
- Built target llama-bench
DOTPROD compile-path evidence:
- Checking flags: -march=armv8.2-a+dotprod
- Performing Test HAVE_DOTPROD - Success
- Adding CPU backend variant ggml-cpu: -march=armv8.2-a+dotprod
Other warnings:
- OpenSSL not found, HTTPS support disabled

Runtime Tests

1) llama-completion

Command:

./build-gcc14-dotprod/bin/llama-completion --model ./Hy-MT2-1.8B-1.25Bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -c 4096 -n 64 -st

Result:

Program runs and completes inference.
Runtime dotprod evidence: system_info contains DOTPROD = 1.
Quality check fails for this prompt: expected Chinese translation, but output is garbled/mixed text.
Key warnings:
- no usable GPU found, --gpu-layers ignored
- special_eos_id is not in special_eog_ids (tokenizer config may be incorrect)

Observed garbled output (exact excerpt):

emberemberemberೆೆೆೆRL慈 dye dye�RLRL��重的 MeadRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRL�لیلیلیلیلیلیلیلیلی码码།美美美美美美美美美美

Conclusion

Build status: PASS (both requested targets).
DOTPROD status: PASS (compile-time check passed and runtime reports DOTPROD = 1).
Runtime status: PASS (both binaries start/run).
Functional correctness for translation prompt: FAIL (garbled output).

sjl623 · 2026-06-03T12:10:44Z

Build and Test Report (GCC14 + DOTPROD Enabled)

@sjl623 Hi, I'm getting incorrect translation outputs when using the 1.25-bit GGUF quantization (both Neon and scalar). Could you please take a look?

Scope

Repo: llama.cpp

Commit: 797583a

Date: 2026-06-03

Built targets: llama-completion, llama-bench

Model Checksums (SHA256)

All models are downloaded.

./Hy-MT2-1.8B-1.25Bit.gguf: fda3e7462018e35188356b2cbb0726ea18ec9c4f104c357f6232c3f780df4135

./Hy-MT1.5-1.8B-1.25bit.gguf: 987121bc98dd7107078019f63e72447d67b224efe97da75811e74529e22e3525

Environment

OS: Linux 6.17.0-1014-nvidia (Ubuntu), aarch64

CPU: 20 cores (Cortex-X925 + Cortex-A725)

Memory: 119 GiB RAM, 15 GiB swap

Toolchain: cmake 3.28.3, gcc/g++ 14.2.0

Build (GCC14, forced ARM dotprod)

Command:
cmake -S . -B build-gcc14-dotprod \
  -DCMAKE_C_COMPILER=gcc-14 \
  -DCMAKE_CXX_COMPILER=g++-14 \
  -DGGML_NATIVE=OFF \
  -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod && \
cmake --build build-gcc14-dotprod --target llama-completion llama-bench -j"$(nproc)"
Result:

Success: both targets built

Built target llama-completion

Built target llama-bench

DOTPROD compile-path evidence:

Checking flags: -march=armv8.2-a+dotprod

Performing Test HAVE_DOTPROD - Success

Adding CPU backend variant ggml-cpu: -march=armv8.2-a+dotprod

Other warnings:

OpenSSL not found, HTTPS support disabled

Runtime Tests

1) llama-completion

Command:
./build-gcc14-dotprod/bin/llama-completion --model ./Hy-MT2-1.8B-1.25Bit.gguf -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -c 4096 -n 64 -st
Result:

Program runs and completes inference.

Runtime dotprod evidence: system_info contains DOTPROD = 1.

Quality check fails for this prompt: expected Chinese translation, but output is garbled/mixed text.

Key warnings:

no usable GPU found, --gpu-layers ignored

special_eos_id is not in special_eog_ids (tokenizer config may be incorrect)

Observed garbled output (exact excerpt):
emberemberemberೆೆೆೆRL慈 dye dye�RLRL��重的 MeadRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLRL�لیلیلیلیلیلیلیلیلی码码།美美美美美美美美美美
Conclusion

Build status: PASS (both requested targets).

DOTPROD status: PASS (compile-time check passed and runtime reports DOTPROD = 1).

Runtime status: PASS (both binaries start/run).

Functional correctness for translation prompt: FAIL (garbled output).

Thanks for your interest! The 1.25-bit packing format was updated (stride-16) in this PR, so the old GGUF no longer
works. The new Hy-MT1.5-1.25bit GGUF is ready — please re-download and retry: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/blob/main/Hy-MT1.5-1.8B-1.25bit.gguf

The Hy-MT2 1.25-bit (stride-16) model is training and will be released soon.

zhengqwe · 2026-06-04T14:23:27Z

The performance is abnormal. It's only about 1 token/s.

./llama-server.exe -m Hy-MT1.5-1.8B-1.25bit.gguf --temp 0.7 --top-p 0.6 --top-k 20 --repeat-penalty 1.05 -c 4096 --no-mmap -t 4 --cache-ram 0 --no-cache-idle-slots --jinja
0.00.004.907 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.004.909 I device_info:
0.00.004.914 I   - CPU     : Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (12122 MiB, 4206 MiB free)
0.00.004.956 I system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.004.959 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.005.016 I srv          init: running without SSL
0.00.005.046 I srv          init: using 8 threads for HTTP server
0.00.005.268 I srv         start: binding port with default address family
0.00.026.913 I srv  llama_server: loading model
0.00.026.947 I srv    load_model: loading model 'Hy-MT1.5-1.8B-1.25bit.gguf'
0.00.027.007 I common_init_result: fitting params to device memory ...
0.00.027.008 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.362.519 I common_params_fit_impl: projected to use 935 MiB of host memory vs. 12122 MiB of total host memory
0.00.603.582 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.00.852.259 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.559.237 I srv    load_model: initializing slots, n_slots = 4
0.04.264.987 W common_speculative_init: no implementations specified for speculative decoding
0.04.264.996 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.04.264.997 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.04.264.998 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.04.264.998 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
0.04.265.140 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.04.265.140 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.04.265.141 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.04.271.652 I init: chat template, example_format: '<｜hy_begin▁of▁sentence｜>You are a helpful assistant<｜hy_place▁holder▁no▁3｜><｜hy_User｜>Hello<｜hy_Assistant｜>Hi there<｜hy_place▁holder▁no▁2｜><｜hy_User｜>How are you?<｜hy_Assistant｜>'
0.04.274.094 I srv          init: init: chat template, thinking = 0
0.04.274.111 I srv  llama_server: model loaded
0.04.274.112 I srv  llama_server: server is listening on http://127.0.0.1:8080
0.04.274.127 I srv  update_slots: all slots are idle
0.15.267.681 I srv  params_from_: Chat format: peg-native
0.15.267.938 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.15.268.535 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
1.06.306.853 I slot print_timing: id  3 | task 0 | prompt eval time =   30965.11 ms /    34 tokens (  910.74 ms per token,     1.10 tokens per second)
1.06.306.857 I slot print_timing: id  3 | task 0 |        eval time =   20073.15 ms /    22 tokens (  912.42 ms per token,     1.10 tokens per second)
1.06.306.859 I slot print_timing: id  3 | task 0 |       total time =   51038.26 ms /    56 tokens
1.06.306.862 I slot print_timing: id  3 | task 0 |    graphs reused =         21
1.06.306.892 I slot      release: id  3 | task 0 | stop processing: n_tokens = 55, truncated = 0
1.06.306.901 I srv  update_slots: all slots are idle

sjl623 · 2026-06-04T14:31:15Z

The performance is abnormal. It's only about 1 token/s.

./llama-server.exe -m Hy-MT1.5-1.8B-1.25bit.gguf --temp 0.7 --top-p 0.6 --top-k 20 --repeat-penalty 1.05 -c 4096 --no-mmap -t 4 --cache-ram 0 --no-cache-idle-slots --jinja
0.00.004.907 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.004.909 I device_info:
0.00.004.914 I   - CPU     : Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (12122 MiB, 4206 MiB free)
0.00.004.956 I system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.004.959 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.005.016 I srv          init: running without SSL
0.00.005.046 I srv          init: using 8 threads for HTTP server
0.00.005.268 I srv         start: binding port with default address family
0.00.026.913 I srv  llama_server: loading model
0.00.026.947 I srv    load_model: loading model 'Hy-MT1.5-1.8B-1.25bit.gguf'
0.00.027.007 I common_init_result: fitting params to device memory ...
0.00.027.008 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.362.519 I common_params_fit_impl: projected to use 935 MiB of host memory vs. 12122 MiB of total host memory
0.00.603.582 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.00.852.259 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.559.237 I srv    load_model: initializing slots, n_slots = 4
0.04.264.987 W common_speculative_init: no implementations specified for speculative decoding
0.04.264.996 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.04.264.997 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.04.264.998 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.04.264.998 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
0.04.265.140 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.04.265.140 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.04.265.141 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.04.271.652 I init: chat template, example_format: '<｜hy_begin▁of▁sentence｜>You are a helpful assistant<｜hy_place▁holder▁no▁3｜><｜hy_User｜>Hello<｜hy_Assistant｜>Hi there<｜hy_place▁holder▁no▁2｜><｜hy_User｜>How are you?<｜hy_Assistant｜>'
0.04.274.094 I srv          init: init: chat template, thinking = 0
0.04.274.111 I srv  llama_server: model loaded
0.04.274.112 I srv  llama_server: server is listening on http://127.0.0.1:8080
0.04.274.127 I srv  update_slots: all slots are idle
0.15.267.681 I srv  params_from_: Chat format: peg-native
0.15.267.938 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.15.268.535 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
1.06.306.853 I slot print_timing: id  3 | task 0 | prompt eval time =   30965.11 ms /    34 tokens (  910.74 ms per token,     1.10 tokens per second)
1.06.306.857 I slot print_timing: id  3 | task 0 |        eval time =   20073.15 ms /    22 tokens (  912.42 ms per token,     1.10 tokens per second)
1.06.306.859 I slot print_timing: id  3 | task 0 |       total time =   51038.26 ms /    56 tokens
1.06.306.862 I slot print_timing: id  3 | task 0 |    graphs reused =         21
1.06.306.892 I slot      release: id  3 | task 0 | stop processing: n_tokens = 55, truncated = 0
1.06.306.901 I srv  update_slots: all slots are idle

Looks like you're running on x86 (Intel i3, AVX2). At the moment the accelerated kernel is ARM-only, so on x86 it falls back to a naive C implementation, which is much slower. An optimized x86 kernel is already on the way.

dajia6462-lgtm · 2026-06-05T01:51:59Z

Can't we integrate Hymt with STQ acceleration in Android now?

devYRPauli · 2026-06-05T02:06:12Z

Took a look at this on Apple Silicon since the failing CI jobs are ARM/x86 high-perf and I have an M1 Pro handy. Sharing a datapoint and where I think the bug is.

On M1 (ARMv8.2, __ARM_FEATURE_DOTPROD confirmed, the linked symbol really does compile down to sdot + tbl), the STQ1_0 NEON vec_dot is bit-exact against the scalar reference: 4000 trials of 1024 elements, max abs error 0, NMSE 0, and clean under -fsanitize=undefined. I also checked the vqtbl2q OR-emulation directly and it matches a real vqtbl2q_u8 for all 256 indices. So the algorithm and data layout look correct, and I could not reproduce the wrong-results report on Apple clang.

That points the finger at the table-lookup emulation on the other toolchains rather than the algorithm. My main suspect for the red cpu-x64-high-perf job: the vqtbl2q emulation in ggml/src/ggml-cpu/arch/arm/quants.c (around lines 1645-1646) leans on out-of-range tbl indices returning 0. That holds for ARM tbl, but x86 _mm_shuffle_epi8 only zeroes a byte when the index's high bit is set, not for general out-of-range values 16..127. If the x86 path lowers the two-register lookup through pshufb, indices in that range won't zero and will read the wrong codebook byte. Worth a look at the sign-expansion table too (table_b2b_0), since the kernel depends on the 0x10 variant so the sign bit lands at bit 4 of the codebook index.

One other thing: GGML_TYPE_STQ1_0 isn't in the test-backend-ops type lists (tests/test-backend-ops.cpp, the all_types/base_types/other_types arrays), same as TQ1_0/TQ2_0, so the standard correctness gate never actually exercises this kernel. Adding it there would let the high-perf runners catch exactly this kind of divergence directly instead of relying on field reports.

Caveat: I can only test Apple clang on M1, so this localizes the issue to the x86/GCC table-lookup emulation rather than clearing those paths. Happy to run anything else on ARM if it helps.

sjl623 · 2026-06-05T03:04:53Z

Can't we integrate Hymt with STQ acceleration in Android now?

Thanks for your interest. The answer is YES! Two ways to try it on Android:

Pull our branch and build the examples/llama.android project, then load our public HyMT GGUF
weights.
Or just download our demo APK and try it directly: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit/blob/main/Hy-MT-demo.apk

LFF28 · 2026-06-05T03:34:50Z

Thanks for your interest! The 1.25-bit packing format was updated (stride-16) in this PR, so the old GGUF no longer works. The new Hy-MT1.5-1.25bit GGUF is ready — please re-download and retry: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/blob/main/Hy-MT1.5-1.8B-1.25bit.gguf

The Hy-MT2 1.25-bit (stride-16) model is training and will be released soon.

Thanks! I got correct translation output with this GGUF. FYI, the shasum256 code of this GGUF is 93e025c93cc082e73a3f142b757623a8b9cf541c020a8013ca4ee669556860ab.

sjl623 · 2026-06-05T03:47:02Z

Took a look at this on Apple Silicon since the failing CI jobs are ARM/x86 high-perf and I have an M1 Pro handy. Sharing a datapoint and where I think the bug is.

On M1 (ARMv8.2, __ARM_FEATURE_DOTPROD confirmed, the linked symbol really does compile down to sdot + tbl), the STQ1_0 NEON vec_dot is bit-exact against the scalar reference: 4000 trials of 1024 elements, max abs error 0, NMSE 0, and clean under -fsanitize=undefined. I also checked the vqtbl2q OR-emulation directly and it matches a real vqtbl2q_u8 for all 256 indices. So the algorithm and data layout look correct, and I could not reproduce the wrong-results report on Apple clang.

That points the finger at the table-lookup emulation on the other toolchains rather than the algorithm. My main suspect for the red cpu-x64-high-perf job: the vqtbl2q emulation in ggml/src/ggml-cpu/arch/arm/quants.c (around lines 1645-1646) leans on out-of-range tbl indices returning 0. That holds for ARM tbl, but x86 _mm_shuffle_epi8 only zeroes a byte when the index's high bit is set, not for general out-of-range values 16..127. If the x86 path lowers the two-register lookup through pshufb, indices in that range won't zero and will read the wrong codebook byte. Worth a look at the sign-expansion table too (table_b2b_0), since the kernel depends on the 0x10 variant so the sign bit lands at bit 4 of the codebook index.

One other thing: GGML_TYPE_STQ1_0 isn't in the test-backend-ops type lists (tests/test-backend-ops.cpp, the all_types/base_types/other_types arrays), same as TQ1_0/TQ2_0, so the standard correctness gate never actually exercises this kernel. Adding it there would let the high-perf runners catch exactly this kind of divergence directly instead of relying on field reports.

Caveat: I can only test Apple clang on M1, so this localizes the issue to the x86/GCC table-lookup emulation rather than clearing those paths. Happy to run anything else on ARM if it helps.

Thanks a lot for digging into this and confirming the correctness of the kernel implementation. Really appreciate the result that the NEON kernel is bit-for-bit identical to the scalar reference, plus the clean UBSan(-fsanitize=undefined) run. 🙏
On the two red jobs, they seem unrelated to this PR:

cpu-x64-high-perf: seems the runner was canceled mid-run (exit 130) rather than an actual test failure.
cpu-arm64-high-perf: this seems to be the MUL_MAT_HADAMARD failure, which looks like was just fixed upstream in ggml-cpu: use runtime SVE width in FWHT #24059.

I'll rebase to latest master and try again.

1.3125 bpw quantization. Each block of 256 elements stores 64 groups of 4 ternary lanes (-1/0/+1) with the constraint that every group has exactly one zero and three non-zero lanes of identical magnitude. The ternary pattern is encoded as a 4-bit codebook index plus a 1-bit global sign, yielding 32 patterns over 4 lanes (5 bits / 4 lanes = 1.25 bpw payload), plus a per-block fp16 scale (0.0625 bpw) for 1.3125 bpw total. Components: - block_stq_0 layout and codebook in ggml-common.h - reference quantize/dequantize/validate in ggml-quants.{h,c} - generic CPU vec_dot in ggml-cpu/quants.{h,c} - ARM NEON vec_dot using vqtbl2q for codebook lookup, vdotq_s32 for accumulation, plus vld4q-based in-place repack of Q8_K activations - enum slots: GGML_TYPE_STQ_0 = 42, LLAMA_FTYPE_MOSTLY_STQ_0 = 41, GGMLQuantizationType.STQ_0 = 42, LlamaFileType.MOSTLY_STQ_0 = 41 - llama-quantize CLI option "STQ_0"

sjl623 · 2026-06-05T04:25:03Z

Took a look at this on Apple Silicon since the failing CI jobs are ARM/x86 high-perf and I have an M1 Pro handy. Sharing a datapoint and where I think the bug is.
On M1 (ARMv8.2, __ARM_FEATURE_DOTPROD confirmed, the linked symbol really does compile down to sdot + tbl), the STQ1_0 NEON vec_dot is bit-exact against the scalar reference: 4000 trials of 1024 elements, max abs error 0, NMSE 0, and clean under -fsanitize=undefined. I also checked the vqtbl2q OR-emulation directly and it matches a real vqtbl2q_u8 for all 256 indices. So the algorithm and data layout look correct, and I could not reproduce the wrong-results report on Apple clang.
That points the finger at the table-lookup emulation on the other toolchains rather than the algorithm. My main suspect for the red cpu-x64-high-perf job: the vqtbl2q emulation in ggml/src/ggml-cpu/arch/arm/quants.c (around lines 1645-1646) leans on out-of-range tbl indices returning 0. That holds for ARM tbl, but x86 _mm_shuffle_epi8 only zeroes a byte when the index's high bit is set, not for general out-of-range values 16..127. If the x86 path lowers the two-register lookup through pshufb, indices in that range won't zero and will read the wrong codebook byte. Worth a look at the sign-expansion table too (table_b2b_0), since the kernel depends on the 0x10 variant so the sign bit lands at bit 4 of the codebook index.
One other thing: GGML_TYPE_STQ1_0 isn't in the test-backend-ops type lists (tests/test-backend-ops.cpp, the all_types/base_types/other_types arrays), same as TQ1_0/TQ2_0, so the standard correctness gate never actually exercises this kernel. Adding it there would let the high-perf runners catch exactly this kind of divergence directly instead of relying on field reports.
Caveat: I can only test Apple clang on M1, so this localizes the issue to the x86/GCC table-lookup emulation rather than clearing those paths. Happy to run anything else on ARM if it helps.

Thanks a lot for digging into this and confirming the correctness of the kernel implementation. Really appreciate the result that the NEON kernel is bit-for-bit identical to the scalar reference, plus the clean UBSan(-fsanitize=undefined) run. 🙏 On the two red jobs, they seem unrelated to this PR:

cpu-x64-high-perf: seems the runner was canceled mid-run (exit 130) rather than an actual test failure.

cpu-arm64-high-perf: this seems to be the MUL_MAT_HADAMARD failure, which looks like was just fixed upstream in ggml-cpu: use runtime SVE width in FWHT #24059.

I'll rebase to latest master and try again.

Hi @CISC, I have rebased to latest master. Could you trigger the CI again? Thanks!

devYRPauli · 2026-06-05T13:47:37Z

Thanks for the clarification, that's good to know. If cpu-x64-high-perf was a canceled runner (exit 130) and cpu-arm64-high-perf was the MUL_MAT_HADAMARD failure from #24059, then neither red job points at the kernel, which lines up with the bit-exact result I got on ARM.

One thing that follows from that, though: because both high-perf jobs were canceled or unrelated, the x86/GCC build never actually ran a correctness check over STQ1_0, so the vqtbl2q emulation path there is still untested rather than confirmed good. And that loops back to the same gap, the kernel isn't in the test-backend-ops type lists (all_types / base_types / other_types), so even a fully green high-perf run wouldn't have exercised it. Adding GGML_TYPE_STQ1_0 there (alongside TQ1_0 / TQ2_0) would make the high-perf runners gate the kernel directly, so if there ever is an x86 divergence it surfaces as a real test failure instead of a field report.

Happy to run the rebased branch on M1 once it's up to confirm it stays green on ARM. Thanks for the work on this.

dajia6462-lgtm · 2026-06-06T09:27:52Z

我们现在不能在Android中集成Hymt和STQ加速吗？

谢谢你的关心。答案是肯定的！有两种方法可以在Android上试用：

拔下我们的树枝建造examples/llama.android项目，然后加载我们的公共HyMT GGUF重量。

或者只需下载我们的演示安装包并直接试用：https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit/blob/main/Hy-MT-demo.apk

I integrated it and tested it on a Snapdragon 870 Android phone. The speed was only 9-13. Is it because the device is too poor or there is a mistake?

sjl623 requested review from CISC and ggerganov as code owners May 8, 2026 11:07

This comment was marked as resolved.

Sign in to view

github-actions Bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 8, 2026

sjl623 changed the title ~~ggml-cpu : add STQ_0 ternary quantization with ARM NEON vec_dot kernel~~ ggml-cpu : add STQ1_0 ternary quantization with ARM NEON vec_dot kernel May 8, 2026

sjl623 closed this May 12, 2026

sjl623 reopened this May 12, 2026

justinchuby mentioned this pull request May 18, 2026

MatMulNBits: support 1.25-bit ternary Sherry / STQ1_0 quantization microsoft/onnxruntime#28549

Open

CISC mentioned this pull request May 19, 2026

CUDA: add STQ1_0 dequantization kernel #23332

Closed

sjl623 force-pushed the STQ_0 branch from 45c4c1c to 2015a7f Compare May 22, 2026 11:05

harrywl mentioned this pull request May 22, 2026

[Feature Request] Better WebGPU / Browser-based Inference Support via llama.cpp WebGPU Backend Tencent-Hunyuan/Hy-MT2#2

Open

github-actions Bot added the testing Everything test related label May 22, 2026

Green-Sky mentioned this pull request May 24, 2026

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

Open

sjl623 force-pushed the STQ_0 branch from 2015a7f to 74e4187 Compare May 26, 2026 09:34

Little0o0 mentioned this pull request May 28, 2026

How do I create a Modelfile for ollama? Tencent/AngelSlim#316

Open

qqba mentioned this pull request May 29, 2026

[Feature Request] Support for Tencent Hy-MT2-30B (Requires upstream STQ kernel PR #22836)？ JamePeng/llama-cpp-python#134

Open

sjl623 force-pushed the STQ_0 branch from 74e4187 to 797583a Compare June 1, 2026 03:52

jinlongsong added 3 commits June 5, 2026 11:53

rename STQ_0 to STQ1_0 according to reviewer suggestion

9b898c1

ggml-cpu : fix STQ1_0 CI failures

f8b355a

sjl623 force-pushed the STQ_0 branch from 797583a to f8b355a Compare June 5, 2026 03:56

Conversation

sjl623 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

STQ1_0: Stride-16 Sparsity Layout

Acknowledge

How to use it

1. Clone and build

2. Download the HF model

3. Convert HF → GGUF (bf16)

4. Quantize bf16 → STQ1_0

5. Run a completion (CPU only, with chat template)

Performance

Requirements

Uh oh!

This comment was marked as resolved.

CISC commented May 8, 2026

Uh oh!

Green-Sky commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjl623 commented May 8, 2026

Uh oh!

sjl623 commented May 8, 2026

Uh oh!

sjl623 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cherish-ltt commented May 9, 2026

Uh oh!

sjl623 commented May 9, 2026

Uh oh!

cherish-ltt commented May 9, 2026

Uh oh!

Little0o0 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dima-xd commented May 10, 2026

Uh oh!

Little0o0 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dima-xd commented May 10, 2026

Uh oh!

tritueviet commented May 11, 2026

Uh oh!

sjl623 commented May 11, 2026

Uh oh!

tritueviet commented May 12, 2026

Uh oh!

sjl623 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tritueviet commented May 12, 2026

Uh oh!

CISC commented May 22, 2026

Uh oh!

sjl623 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjl623 commented Jun 1, 2026

Uh oh!

LFF28 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build and Test Report (GCC14 + DOTPROD Enabled)

Scope

Model Checksums (SHA256)

Environment

Build (GCC14, forced ARM dotprod)

Runtime Tests

1) llama-completion

Conclusion

Uh oh!

sjl623 commented Jun 3, 2026

Build and Test Report (GCC14 + DOTPROD Enabled)

Scope

Model Checksums (SHA256)

Environment

Build (GCC14, forced ARM dotprod)

Runtime Tests

sjl623 commented May 8, 2026 •

edited

Loading

Green-Sky commented May 8, 2026 •

edited

Loading

sjl623 commented May 8, 2026 •

edited

Loading

Little0o0 commented May 9, 2026 •

edited

Loading

Little0o0 commented May 10, 2026 •

edited

Loading

sjl623 commented May 12, 2026 •

edited

Loading

sjl623 commented May 26, 2026 •

edited

Loading

LFF28 commented Jun 3, 2026 •

edited

Loading