Skip to content

ggml: allow split-mode tensor to use different quantization types.#23225

Closed
RedToasty wants to merge 1 commit into
ggml-org:masterfrom
RedToasty:tests/fix_tensor_split_quants
Closed

ggml: allow split-mode tensor to use different quantization types.#23225
RedToasty wants to merge 1 commit into
ggml-org:masterfrom
RedToasty:tests/fix_tensor_split_quants

Conversation

@RedToasty
Copy link
Copy Markdown

KV cache quantization previously flattened the Hadamard rotation input to 2D, which loses the split-axis metadata needed by tensor parallelism. This keeps the rotation path in 4D form so the Meta backend can continue to track the split axis..

The asserts should never fire, as we check the alignment earlier, it's more to block any breaking changes later on.

Overview

The main purpose of this commit is to fix the ability to use KV quantisation, when using "--split-mode tensor".

Additional information

Users with multiple GPUs currently cannot combine tensor parallelism with quantized KV cache. This change enables that combination.

In local testing I find around a ~40% boost in token generation with tensor parallelism, regardless of KV cache quantization level. The branch has been shared on Reddit, for slightly wider testing, so far with only positive response.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Codex was used to help track down any areas that might be affected by the split mode, as well as sanity check changes. It turned out not many areas actually needed fixing up, everything has been reviewed and tested.

KV cache quantization previously flattened the rotation matmul input to 2D, which lost the split-axis metadata needed for tensor parallelism. Keep the rotation path in 4D form and teach the meta backend to propagate the split state for mirrored rotation matrices applied to tensors split on axis 2/3, then remove the init-time guard.
@RedToasty RedToasty changed the title Allow split-mode tensor to use different quantization types. ggml: allow split-mode tensor to use different quantization types. May 17, 2026
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 17, 2026
@nifgraup
Copy link
Copy Markdown

nifgraup commented May 25, 2026

Crashes with combinations other than q8_0/q8_0 & q4_0/q4_0:

ggml-backend-meta.cpp:532: GGML_ASSERT(ret.axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN) failed

@arie-s
Copy link
Copy Markdown

arie-s commented May 26, 2026

Running a similar local patch (same reshape_4d + mirrored+axis-2/3 handler) on 2x RTX 5060 Ti 16GB with Qwen3.6-27B MTP + tensor split.

On the pre-#22616 codebase this gets ~60 tok/s decode with draft-3. After #22616 it drops to ~40 tok/s, both f16 and q8 KV. The stc_compute rotation clears and rebuilds simple_tensors every graph_compute, and MTP triggers multiple rebuilds per token. Layer-split MoE is unaffected (actually +7% from the MTP fixes in master).

Flagging in case it's useful for review. The reshape_4d fix works, the perf issue is in how #22616 handles external view tensor lifetimes with MTP + tensor split.

@RedToasty
Copy link
Copy Markdown
Author

There does indeed seem to be issues with non-matching pairs. It almost feels worth while just shifting the assert to catch that case, as the speedup from tensor is huge, especially with the new memory fixes in there. I'm going to have a look for a proper fix today.

@RedToasty RedToasty closed this May 31, 2026
@RedToasty
Copy link
Copy Markdown
Author

This seems irrelevant with changes like #23792 which properly rework the entire area.

@RedToasty RedToasty deleted the tests/fix_tensor_split_quants branch May 31, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants