Skip to content

TP: fix delayed AllReduce + zero-sized slices#22489

Merged
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
JohannesGaessler:tp-fix-delayed-reduce
Apr 29, 2026
Merged

TP: fix delayed AllReduce + zero-sized slices#22489
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
JohannesGaessler:tp-fix-delayed-reduce

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Fixes #22391 .

The problem is that k-quants have a block size of 256 vs. the size of a single expert at 512. So for 3+ GPUs one of them ends up with a zero-sized slice. This would normally not be an issue since a zero-sized slice is supported; the corresponding nodes are disabled and the backend participates in the following AllReduce with a zeroed out buffer in order to receive the results of other backends. However, the interaction of a zero-sized slice and a delayed AllReduce for better MoE performance does not work correctly. For those the range of disabled nodes needs to be extended, otherwise one of the backends will have garbage data prior to the AllReduce.

Using 3x RTX 4090 the Qwen 3.6 q4_K_M PPL on the first 512 tokens of Wikitext is 4.1590 for -sm layer, for -sm tensor on master it's 8.3604, for -sm tensor with this PR it's 4.1554.

Requirements

@rankaiyx
Copy link
Copy Markdown
Contributor

Tested and verified. The issue is resolved.

@JohannesGaessler JohannesGaessler merged commit 739393b into ggml-org:master Apr 29, 2026
45 of 46 checks passed
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Tensor parallelism on 3+ GPUs causes infinite output (////////////////////////...) with Qwen3.6-35B-A3B models

4 participants