ggml: allow split-mode tensor to use different quantization types. by RedToasty · Pull Request #23225 · ggml-org/llama.cpp

RedToasty · 2026-05-17T17:51:00Z

KV cache quantization previously flattened the Hadamard rotation input to 2D, which loses the split-axis metadata needed by tensor parallelism. This keeps the rotation path in 4D form so the Meta backend can continue to track the split axis..

The asserts should never fire, as we check the alignment earlier, it's more to block any breaking changes later on.

Overview

The main purpose of this commit is to fix the ability to use KV quantisation, when using "--split-mode tensor".

Additional information

Users with multiple GPUs currently cannot combine tensor parallelism with quantized KV cache. This change enables that combination.

In local testing I find around a ~40% boost in token generation with tensor parallelism, regardless of KV cache quantization level. The branch has been shared on Reddit, for slightly wider testing, so far with only positive response.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Codex was used to help track down any areas that might be affected by the split mode, as well as sanity check changes. It turned out not many areas actually needed fixing up, everything has been reviewed and tested.

KV cache quantization previously flattened the rotation matmul input to 2D, which lost the split-axis metadata needed for tensor parallelism. Keep the rotation path in 4D form and teach the meta backend to propagate the split state for mirrored rotation matrices applied to tensors split on axis 2/3, then remove the init-time guard.

nifgraup · 2026-05-25T21:21:54Z

Crashes with combinations other than q8_0/q8_0 & q4_0/q4_0:

ggml-backend-meta.cpp:532: GGML_ASSERT(ret.axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN) failed

arie-s · 2026-05-26T04:18:11Z

Running a similar local patch (same reshape_4d + mirrored+axis-2/3 handler) on 2x RTX 5060 Ti 16GB with Qwen3.6-27B MTP + tensor split.

On the pre-#22616 codebase this gets ~60 tok/s decode with draft-3. After #22616 it drops to ~40 tok/s, both f16 and q8 KV. The stc_compute rotation clears and rebuilds simple_tensors every graph_compute, and MTP triggers multiple rebuilds per token. Layer-split MoE is unaffected (actually +7% from the MTP fixes in master).

Flagging in case it's useful for review. The reshape_4d fix works, the perf issue is in how #22616 handles external view tensor lifetimes with MTP + tensor split.

RedToasty · 2026-05-31T07:02:33Z

There does indeed seem to be issues with non-matching pairs. It almost feels worth while just shifting the assert to catch that case, as the speedup from tensor is huge, especially with the new memory fixes in there. I'm going to have a look for a proper fix today.

RedToasty · 2026-05-31T08:46:42Z

This seems irrelevant with changes like #23792 which properly rework the entire area.

RedToasty requested review from CISC, JohannesGaessler and ggerganov as code owners May 17, 2026 17:51

RedToasty changed the title ~~Allow split-mode tensor to use different quantization types.~~ ggml: allow split-mode tensor to use different quantization types. May 17, 2026

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 17, 2026

nifgraup mentioned this pull request May 27, 2026

TP: quantized KV cache support #23792

Merged

RedToasty closed this May 31, 2026

RedToasty deleted the tests/fix_tensor_split_quants branch May 31, 2026 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: allow split-mode tensor to use different quantization types.#23225

ggml: allow split-mode tensor to use different quantization types.#23225
RedToasty wants to merge 1 commit into
ggml-org:masterfrom
RedToasty:tests/fix_tensor_split_quants

RedToasty commented May 17, 2026

Uh oh!

nifgraup commented May 25, 2026 •

edited

Loading

Uh oh!

arie-s commented May 26, 2026 •

edited

Loading

Uh oh!

RedToasty commented May 31, 2026

Uh oh!

RedToasty commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RedToasty commented May 17, 2026

Overview

Additional information

Requirements

Uh oh!

nifgraup commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arie-s commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RedToasty commented May 31, 2026

Uh oh!

RedToasty commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nifgraup commented May 25, 2026 •

edited

Loading

arie-s commented May 26, 2026 •

edited

Loading