Skip to content

CUDA: add STQ1_0 dequantization kernel#23332

Closed
antrc2 wants to merge 6 commits into
ggml-org:masterfrom
antrc2:feature/stq1_0_cuda_kernel
Closed

CUDA: add STQ1_0 dequantization kernel#23332
antrc2 wants to merge 6 commits into
ggml-org:masterfrom
antrc2:feature/stq1_0_cuda_kernel

Conversation

@antrc2
Copy link
Copy Markdown

@antrc2 antrc2 commented May 19, 2026

Adds CUDA kernels for STQ1_0 quantized tensors.

This enables GPU acceleration for STQ1_0 inference paths,
reducing CPU fallback during matmul/dequant operations.

Key changes:

  • add STQ1_0 CUDA dequant kernel
  • integrate STQ1_0 into CUDA quant dispatch
  • enable GPU execution for STQ1_0 tensor ops

Tested with:

  • Hy-MT1.5-1.8B-STQ1_0
  • CUDA backend

jinlongsong and others added 4 commits May 7, 2026 15:13
1.3125 bpw quantization. Each block of 256 elements stores 64 groups of 4
ternary lanes (-1/0/+1) with the constraint that every group has exactly
one zero and three non-zero lanes of identical magnitude. The ternary
pattern is encoded as a 4-bit codebook index plus a 1-bit global sign,
yielding 32 patterns over 4 lanes (5 bits / 4 lanes = 1.25 bpw payload),
plus a per-block fp16 scale (0.0625 bpw) for 1.3125 bpw total.

Components:
- block_stq_0 layout and codebook in ggml-common.h
- reference quantize/dequantize/validate in ggml-quants.{h,c}
- generic CPU vec_dot in ggml-cpu/quants.{h,c}
- ARM NEON vec_dot using vqtbl2q for codebook lookup, vdotq_s32 for
  accumulation, plus vld4q-based in-place repack of Q8_K activations
- enum slots: GGML_TYPE_STQ_0 = 42, LLAMA_FTYPE_MOSTLY_STQ_0 = 41,
  GGMLQuantizationType.STQ_0 = 42, LlamaFileType.MOSTLY_STQ_0 = 41
- llama-quantize CLI option "STQ_0"
@antrc2 antrc2 requested review from a team, CISC and ggerganov as code owners May 19, 2026 09:35
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 19, 2026

Hi @antrc2, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 19, 2026

What is STQ1_0 and how is it different from TQ1_0?

@antrc2
Copy link
Copy Markdown
Author

antrc2 commented May 19, 2026

STQ1_0 is the custom ternary quantization format used by Tencent's Sherry/AngelSlim models, such as Hy-MT1.5-1.8B-1.25bit

Compared to TQ1_0, STQ1_0 uses structured ternary sparsification designed for SIMD/mobile efficiency. In the Sherry paper, weights are grouped in a 3:4 sparsity pattern where 3 weights are stored as {-1, +1} and 1 weight is zeroed, achieving ~1.25 bpw effective compression.

TQ1_0 is a generic ternary quantization format in ggml, while STQ1_0 is specifically designed around the Sherry packing/layout used by Tencent models.

This repo currently supports quantizing BF16 -> STQ1_0, but CUDA kernels were missing, causing heavy CPU fallback during inference. This PR adds CUDA kernel support for STQ1_0 inference paths.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

As is clearly laid out in the llama.cpp contributing guidelines:

When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs

@CISC
Copy link
Copy Markdown
Member

CISC commented May 19, 2026

#22836 you can leave this as draft until that one is merged if you wish.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 19, 2026
@antrc2 antrc2 marked this pull request as draft May 22, 2026 04:30
@antrc2 antrc2 closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants