ggml: allow split-mode tensor to use different quantization types.#23225
ggml: allow split-mode tensor to use different quantization types.#23225RedToasty wants to merge 1 commit into
Conversation
KV cache quantization previously flattened the rotation matmul input to 2D, which lost the split-axis metadata needed for tensor parallelism. Keep the rotation path in 4D form and teach the meta backend to propagate the split state for mirrored rotation matrices applied to tensors split on axis 2/3, then remove the init-time guard.
|
Crashes with combinations other than q8_0/q8_0 & q4_0/q4_0:
|
|
Running a similar local patch (same reshape_4d + mirrored+axis-2/3 handler) on 2x RTX 5060 Ti 16GB with Qwen3.6-27B MTP + tensor split. On the pre-#22616 codebase this gets ~60 tok/s decode with draft-3. After #22616 it drops to ~40 tok/s, both f16 and q8 KV. The stc_compute rotation clears and rebuilds simple_tensors every graph_compute, and MTP triggers multiple rebuilds per token. Layer-split MoE is unaffected (actually +7% from the MTP fixes in master). Flagging in case it's useful for review. The reshape_4d fix works, the perf issue is in how #22616 handles external view tensor lifetimes with MTP + tensor split. |
|
There does indeed seem to be issues with non-matching pairs. It almost feels worth while just shifting the assert to catch that case, as the speedup from tensor is huge, especially with the new memory fixes in there. I'm going to have a look for a proper fix today. |
|
This seems irrelevant with changes like #23792 which properly rework the entire area. |
KV cache quantization previously flattened the Hadamard rotation input to 2D, which loses the split-axis metadata needed by tensor parallelism. This keeps the rotation path in 4D form so the Meta backend can continue to track the split axis..
The asserts should never fire, as we check the alignment earlier, it's more to block any breaking changes later on.
Overview
The main purpose of this commit is to fix the ability to use KV quantisation, when using "--split-mode tensor".
Additional information
Users with multiple GPUs currently cannot combine tensor parallelism with quantized KV cache. This change enables that combination.
In local testing I find around a ~40% boost in token generation with tensor parallelism, regardless of KV cache quantization level. The branch has been shared on Reddit, for slightly wider testing, so far with only positive response.
Requirements