UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full by loci-dev · Pull Request #1006 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-23T10:42:31Z

Mirrored from ggml-org/llama.cpp#19042

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

The NSight profile below shows the issue in more detail:

After setting the environment variable, this is how the new profile looks:

The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.

Performance Gains

Significant perf improvement of 25% in PP throughout for larger models with pipeline parallelism.
Less but measurable perf improvement on single GPU for larger models.
The smaller the GPU and the larger the model, the greater the performance benefit is expected with this environment variable.
No change in performance for smaller models
No change in decode phase throughput
No change in VRAM usage
~120 MB higher system RAM usage per GPU. For two GPUs, sys RAM usage increases by 240 MB.

Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	1682.79	1678.15	1.00
1024	1884.01	2064.24	1.10
2048	1948.14	2289.02	1.17
4096	1841.07	2266.42	1.23
8192	1563.33	1959.12	1.25

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	11467.79	11597.3	1.01
1024	14371.86	14381.97	1.00
2048	15551.36	15537.17	1.00
4096	14545.61	14522.35	1.00
8192	11896.39	11874.36	1.00

Single GPU: RTX Pro 6000 Blackwell

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	1656.24	1686.89	1.02
1024	1567.71	1597.92	1.02
2048	1455.61	1484.2	1.02
4096	1235.89	1314.54	1.06
8192	953.8	976.03	1.02

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	12109.8	12031.88	0.99
1024	11426.8	11426.9	1.00
2048	10589.84	10594.17	1.00
4096	8902.05	8909.34	1.00
8192	6874.66	6872.47	1.00

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

loci-review · 2026-01-23T11:36:47Z

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The static analysis and AI-based performance predictions indicate that the code modifications in this version do not introduce measurable impacts to response time or throughput time for any functions in the llama.cpp binaries.

This suggests that the changes between versions are either:

Non-performance-affecting code modifications (e.g., refactoring, code cleanup, documentation)
Changes to non-critical code paths with negligible execution time
Modifications that maintain performance parity with the previous version

No further performance review is warranted for this comparison.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

gaugarg-nv added 3 commits January 23, 2026 13:37

Set the env variable in the CUDA backend registry allocation

29c73ef

Add link to PR in code comment

14de97e

loci-dev temporarily deployed to PROD__AL_DEMO January 23, 2026 10:42 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from ab9ebfa to a54730b Compare January 23, 2026 11:09

loci-dev force-pushed the main branch 24 times, most recently from edd4e32 to d549af4 Compare January 27, 2026 06:14

loci-dev force-pushed the main branch 30 times, most recently from 96d29ac to dbad616 Compare January 31, 2026 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006

UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006
loci-dev wants to merge 3 commits intomainfrom
upstream-PR19042-branch_gaugarg-nv-pp_perf_improve

loci-dev commented Jan 23, 2026

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 23, 2026

Performance Gains

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants