Skip to content

llama : disable graph reuse with pipeline parallelism#20463

Merged
ggerganov merged 2 commits intomasterfrom
gg/llama-disable-graph-reuse-with-pp
Mar 12, 2026
Merged

llama : disable graph reuse with pipeline parallelism#20463
ggerganov merged 2 commits intomasterfrom
gg/llama-disable-graph-reuse-with-pp

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 12, 2026

The following repro demonstrates the issue:

make -j && ./bin/llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF -f wiki.test.raw --chunks 16 -ngl 99 -ub 512 -b 2048

PPL = 2382.4719 +/- 246.20903

The problem seems to occur when 2 consecutive pp ubatches both output logits, which occurs in perplexity runs with the above parameters: 4 ubatches of size 512, the second 2 ubtaches output logits. I think the graph reuse logic somehow conflicts with the scheduler's logic for tracking the current copy:

GGML_ASSERT(!sched->is_alloc);
sched->cur_copy = sched->next_copy;
sched->next_copy = (sched->next_copy + 1) % sched->n_copies;
ggml_backend_sched_split_graph(sched, graph);

For now disabling graph reuse when pipeline parallelism is active to workaround. Proper investigation is necessary.


Additionally, after #17795, only disabling the graph reuse is not enough to fix the issue. So for now also reverting that change.

Note that both commits in this PR are needed. Neither one fixes the issue alone.

cc @aendk @gaugarg-nv

@ggerganov ggerganov force-pushed the gg/llama-disable-graph-reuse-with-pp branch from 7bc73ae to dfa3ad1 Compare March 12, 2026 16:05
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026
Copy link
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the time to understand what's going on here

@ggerganov ggerganov merged commit 57819b8 into master Mar 12, 2026
48 of 76 checks passed
@ggerganov ggerganov deleted the gg/llama-disable-graph-reuse-with-pp branch March 12, 2026 19:50
@Superbobo75
Copy link

hi,

Starting from version c8323 of llama.cpp, where the graph parameter was disabled, the inference speed of the model Unsloth/Qwen3.5-35B-A3B-Q5_K_M.gguf dropped significantly. On my specific hardware setup—consisting of 2x RTX 5060 Ti (16 GB VRAM) and 64 GB DDR4 RAM—the performance fell from approximately 85 tokens per second (t/s) to around 50 t/s.

Due to this substantial regression, I am currently sticking with the last known functional version prior to this change. I have not yet tested other models to see if this issue is isolated to this specific architecture or configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants