CUDA: add STQ1_0 dequantization kernel#23332
Conversation
1.3125 bpw quantization. Each block of 256 elements stores 64 groups of 4
ternary lanes (-1/0/+1) with the constraint that every group has exactly
one zero and three non-zero lanes of identical magnitude. The ternary
pattern is encoded as a 4-bit codebook index plus a 1-bit global sign,
yielding 32 patterns over 4 lanes (5 bits / 4 lanes = 1.25 bpw payload),
plus a per-block fp16 scale (0.0625 bpw) for 1.3125 bpw total.
Components:
- block_stq_0 layout and codebook in ggml-common.h
- reference quantize/dequantize/validate in ggml-quants.{h,c}
- generic CPU vec_dot in ggml-cpu/quants.{h,c}
- ARM NEON vec_dot using vqtbl2q for codebook lookup, vdotq_s32 for
accumulation, plus vld4q-based in-place repack of Q8_K activations
- enum slots: GGML_TYPE_STQ_0 = 42, LLAMA_FTYPE_MOSTLY_STQ_0 = 41,
GGMLQuantizationType.STQ_0 = 42, LlamaFileType.MOSTLY_STQ_0 = 41
- llama-quantize CLI option "STQ_0"
|
Hi @antrc2, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
What is STQ1_0 and how is it different from TQ1_0? |
|
STQ1_0 is the custom ternary quantization format used by Tencent's Sherry/AngelSlim models, such as Hy-MT1.5-1.8B-1.25bit Compared to TQ1_0, STQ1_0 uses structured ternary sparsification designed for SIMD/mobile efficiency. In the Sherry paper, weights are grouped in a 3:4 sparsity pattern where 3 weights are stored as {-1, +1} and 1 weight is zeroed, achieving ~1.25 bpw effective compression. TQ1_0 is a generic ternary quantization format in ggml, while STQ1_0 is specifically designed around the Sherry packing/layout used by Tencent models. This repo currently supports quantizing BF16 -> STQ1_0, but CUDA kernels were missing, causing heavy CPU fallback during inference. This PR adds CUDA kernel support for STQ1_0 inference paths. |
|
As is clearly laid out in the llama.cpp contributing guidelines:
|
|
#22836 you can leave this as draft until that one is merged if you wish. |
Adds CUDA kernels for STQ1_0 quantized tensors.
This enables GPU acceleration for STQ1_0 inference paths,
reducing CPU fallback during matmul/dequant operations.
Key changes:
Tested with: