TP: fix delayed AllReduce + zero-sized slices by JohannesGaessler · Pull Request #22489 · ggml-org/llama.cpp

JohannesGaessler · 2026-04-28T22:20:15Z

The problem is that k-quants have a block size of 256 vs. the size of a single expert at 512. So for 3+ GPUs one of them ends up with a zero-sized slice. This would normally not be an issue since a zero-sized slice is supported; the corresponding nodes are disabled and the backend participates in the following AllReduce with a zeroed out buffer in order to receive the results of other backends. However, the interaction of a zero-sized slice and a delayed AllReduce for better MoE performance does not work correctly. For those the range of disabled nodes needs to be extended, otherwise one of the backends will have garbage data prior to the AllReduce.

Using 3x RTX 4090 the Qwen 3.6 q4_K_M PPL on the first 512 tokens of Wikitext is 4.1590 for -sm layer, for -sm tensor on master it's 8.3604, for -sm tensor with this PR it's 4.1554.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

rankaiyx · 2026-04-29T00:55:30Z

Tested and verified. The issue is resolved.

(cherry picked from commit 739393b)

TP: fix delayed AllReduce + zero-sized slices

dce2acf

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 28, 2026

JohannesGaessler mentioned this pull request Apr 28, 2026

Misc. bug: Tensor parallelism on 3+ GPUs causes infinite output (////////////////////////...) with Qwen3.6-35B-A3B models #22391

Closed

am17an approved these changes Apr 28, 2026

View reviewed changes

gaugarg-nv approved these changes Apr 29, 2026

View reviewed changes

JohannesGaessler merged commit 739393b into ggml-org:master Apr 29, 2026
45 of 46 checks passed

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

2e67fee

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

79c2881

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

c10c0ec

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

a3743b8

edmcman mentioned this pull request May 19, 2026

Eval bug: Qwen 3.5 weird behavior... #21239

Closed

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

b3ddd82

carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

2107136

(cherry picked from commit 739393b)

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

bada981

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

586d73b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP: fix delayed AllReduce + zero-sized slices#22489

TP: fix delayed AllReduce + zero-sized slices#22489
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
JohannesGaessler:tp-fix-delayed-reduce

JohannesGaessler commented Apr 28, 2026

Uh oh!

rankaiyx commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohannesGaessler commented Apr 28, 2026

Requirements

Uh oh!

rankaiyx commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants