llama : disable graph reuse with pipeline parallelism by ggerganov · Pull Request #20463 · ggml-org/llama.cpp

ggerganov · 2026-03-12T16:04:16Z

The following repro demonstrates the issue:

make -j && ./bin/llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF -f wiki.test.raw --chunks 16 -ngl 99 -ub 512 -b 2048

PPL = 2382.4719 +/- 246.20903

The problem seems to occur when 2 consecutive pp ubatches both output logits, which occurs in perplexity runs with the above parameters: 4 ubatches of size 512, the second 2 ubtaches output logits. I think the graph reuse logic somehow conflicts with the scheduler's logic for tracking the current copy:

llama.cpp/ggml/src/ggml-backend.cpp

Lines 1775 to 1780 in 557fe2d

    
           GGML_ASSERT(!sched->is_alloc); 
        
           sched->cur_copy = sched->next_copy; 
        
           sched->next_copy = (sched->next_copy + 1) % sched->n_copies; 
        
           ggml_backend_sched_split_graph(sched, graph);

For now disabling graph reuse when pipeline parallelism is active to workaround. Proper investigation is necessary.

Additionally, after #17795, only disabling the graph reuse is not enough to fix the issue. So for now also reverting that change.

Note that both commits in this PR are needed. Neither one fixes the issue alone.

cc @aendk @gaugarg-nv

[no ci]

…oken (#17795)" This reverts commit 2cd20b7.

ORippler

Let's take the time to understand what's going on here

Superbobo75 · 2026-03-15T18:07:59Z

hi,

Starting from version c8323 of llama.cpp, where the graph parameter was disabled, the inference speed of the model Unsloth/Qwen3.5-35B-A3B-Q5_K_M.gguf dropped significantly. On my specific hardware setup—consisting of 2x RTX 5060 Ti (16 GB VRAM) and 64 GB DDR4 RAM—the performance fell from approximately 85 tokens per second (t/s) to around 50 t/s.

Due to this substantial regression, I am currently sticking with the last known functional version prior to this change. I have not yet tested other models to see if this issue is isolated to this specific architecture or configuration.

…rg#20463)" This reverts commit 57819b8.

llama : disable graph reuse with pipeline parallelism

dfa3ad1

[no ci]

ggerganov force-pushed the gg/llama-disable-graph-reuse-with-pp branch from 7bc73ae to dfa3ad1 Compare March 12, 2026 16:05

Revert "CUDA: Improve performance via less synchronizations between t…

14be4a4

…oken (#17795)" This reverts commit 2cd20b7.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026

JohannesGaessler approved these changes Mar 12, 2026

View reviewed changes

ORippler approved these changes Mar 12, 2026

View reviewed changes

ggerganov merged commit 57819b8 into master Mar 12, 2026
48 of 76 checks passed

ggerganov deleted the gg/llama-disable-graph-reuse-with-pp branch March 12, 2026 19:50

wahargis mentioned this pull request Mar 15, 2026

perf: PP graph reuse disable (#20463) causes 16% decode regression on V100 #20605

Closed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Mar 17, 2026

Revert "llama : disable graph reuse with pipeline parallelism (ggml-o…

8043f35

…rg#20463)" This reverts commit 57819b8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : disable graph reuse with pipeline parallelism#20463

llama : disable graph reuse with pipeline parallelism#20463
ggerganov merged 2 commits intomasterfrom
gg/llama-disable-graph-reuse-with-pp

ggerganov commented Mar 12, 2026 •

edited

Loading

Uh oh!

ORippler left a comment

Uh oh!

Uh oh!

Superbobo75 commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	GGML_ASSERT(!sched->is_alloc);

	sched->cur_copy = sched->next_copy;
	sched->next_copy = (sched->next_copy + 1) % sched->n_copies;

	ggml_backend_sched_split_graph(sched, graph);

Conversation

ggerganov commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Superbobo75 commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Mar 12, 2026 •

edited

Loading