CUDA: add STQ1_0 dequantization kernel by antrc2 · Pull Request #23332 · ggml-org/llama.cpp

antrc2 · 2026-05-19T09:35:28Z

Adds CUDA kernels for STQ1_0 quantized tensors.

This enables GPU acceleration for STQ1_0 inference paths,
reducing CPU fallback during matmul/dequant operations.

Key changes:

add STQ1_0 CUDA dequant kernel
integrate STQ1_0 into CUDA quant dispatch
enable GPU execution for STQ1_0 tensor ops

Tested with:

Hy-MT1.5-1.8B-STQ1_0
CUDA backend

1.3125 bpw quantization. Each block of 256 elements stores 64 groups of 4 ternary lanes (-1/0/+1) with the constraint that every group has exactly one zero and three non-zero lanes of identical magnitude. The ternary pattern is encoded as a 4-bit codebook index plus a 1-bit global sign, yielding 32 patterns over 4 lanes (5 bits / 4 lanes = 1.25 bpw payload), plus a per-block fp16 scale (0.0625 bpw) for 1.3125 bpw total. Components: - block_stq_0 layout and codebook in ggml-common.h - reference quantize/dequantize/validate in ggml-quants.{h,c} - generic CPU vec_dot in ggml-cpu/quants.{h,c} - ARM NEON vec_dot using vqtbl2q for codebook lookup, vdotq_s32 for accumulation, plus vld4q-based in-place repack of Q8_K activations - enum slots: GGML_TYPE_STQ_0 = 42, LLAMA_FTYPE_MOSTLY_STQ_0 = 41, GGMLQuantizationType.STQ_0 = 42, LlamaFileType.MOSTLY_STQ_0 = 41 - llama-quantize CLI option "STQ_0"

ggml-gh-bot · 2026-05-19T09:40:16Z

Hi @antrc2, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

pwilkin · 2026-05-19T10:01:45Z

What is STQ1_0 and how is it different from TQ1_0?

antrc2 · 2026-05-19T10:06:04Z

STQ1_0 is the custom ternary quantization format used by Tencent's Sherry/AngelSlim models, such as Hy-MT1.5-1.8B-1.25bit

Compared to TQ1_0, STQ1_0 uses structured ternary sparsification designed for SIMD/mobile efficiency. In the Sherry paper, weights are grouped in a 3:4 sparsity pattern where 3 weights are stored as {-1, +1} and 1 weight is zeroed, achieving ~1.25 bpw effective compression.

TQ1_0 is a generic ternary quantization format in ggml, while STQ1_0 is specifically designed around the Sherry packing/layout used by Tencent models.

This repo currently supports quantizing BF16 -> STQ1_0, but CUDA kernels were missing, causing heavy CPU fallback during inference. This PR adds CUDA kernel support for STQ1_0 inference paths.

JohannesGaessler · 2026-05-19T10:16:57Z

As is clearly laid out in the llama.cpp contributing guidelines:

When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs

CISC · 2026-05-19T10:42:28Z

#22836 you can leave this as draft until that one is merged if you wish.

jinlongsong and others added 4 commits May 7, 2026 15:13

Merge branch 'master' into STQ_0

5fe0b67

rename STQ_0 to STQ1_0 according to reviewer suggestion

7ef6976

add CUDA Kernel for quanti STQ1_0

ba9f3d1

antrc2 requested review from a team, CISC and ggerganov as code owners May 19, 2026 09:35

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 19, 2026

antrc2 added 2 commits May 22, 2026 03:17

fix cuda kernel

dcb3af0

fix cuda kernel

144e8bb

antrc2 marked this pull request as draft May 22, 2026 04:30

antrc2 closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: add STQ1_0 dequantization kernel#23332

CUDA: add STQ1_0 dequantization kernel#23332
antrc2 wants to merge 6 commits into
ggml-org:masterfrom
antrc2:feature/stq1_0_cuda_kernel

antrc2 commented May 19, 2026

Uh oh!

ggml-gh-bot Bot commented May 19, 2026

Uh oh!

pwilkin commented May 19, 2026

Uh oh!

antrc2 commented May 19, 2026

Uh oh!

JohannesGaessler commented May 19, 2026

Uh oh!

CISC commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

antrc2 commented May 19, 2026

Uh oh!

ggml-gh-bot Bot commented May 19, 2026

Uh oh!

pwilkin commented May 19, 2026

Uh oh!

antrc2 commented May 19, 2026

Uh oh!

JohannesGaessler commented May 19, 2026

Uh oh!

CISC commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants