llama : disable graph reuse with pipeline parallelism#20463
Merged
llama : disable graph reuse with pipeline parallelism#20463
Conversation
7bc73ae to
dfa3ad1
Compare
JohannesGaessler
approved these changes
Mar 12, 2026
ORippler
approved these changes
Mar 12, 2026
Collaborator
ORippler
left a comment
There was a problem hiding this comment.
Let's take the time to understand what's going on here
|
hi, Starting from version Due to this substantial regression, I am currently sticking with the last known functional version prior to this change. I have not yet tested other models to see if this issue is isolated to this specific architecture or configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The following repro demonstrates the issue:
make -j && ./bin/llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF -f wiki.test.raw --chunks 16 -ngl 99 -ub 512 -b 2048 PPL = 2382.4719 +/- 246.20903The problem seems to occur when 2 consecutive pp ubatches both output logits, which occurs in perplexity runs with the above parameters: 4 ubatches of size 512, the second 2 ubtaches output logits. I think the graph reuse logic somehow conflicts with the scheduler's logic for tracking the current copy:
llama.cpp/ggml/src/ggml-backend.cpp
Lines 1775 to 1780 in 557fe2d
For now disabling graph reuse when pipeline parallelism is active to workaround. Proper investigation is necessary.
Additionally, after #17795, only disabling the graph reuse is not enough to fix the issue. So for now also reverting that change.
Note that both commits in this PR are needed. Neither one fixes the issue alone.
cc @aendk @gaugarg-nv