Skip to content

UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR19042-branch_gaugarg-nv-pp_perf_improve
Open

UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006
loci-dev wants to merge 3 commits intomainfrom
upstream-PR19042-branch_gaugarg-nv-pp_perf_improve

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19042

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

The NSight profile below shows the issue in more detail:

image

After setting the environment variable, this is how the new profile looks:

image

The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.

Performance Gains

  • Significant perf improvement of 25% in PP throughout for larger models with pipeline parallelism.
  • Less but measurable perf improvement on single GPU for larger models.
  • The smaller the GPU and the larger the model, the greater the performance benefit is expected with this environment variable.
  • No change in performance for smaller models
  • No change in decode phase throughput
  • No change in VRAM usage
  • ~120 MB higher system RAM usage per GPU. For two GPUs, sys RAM usage increases by 240 MB.

Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 1682.79 1678.15 1.00
1024 1884.01 2064.24 1.10
2048 1948.14 2289.02 1.17
4096 1841.07 2266.42 1.23
8192 1563.33 1959.12 1.25

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 11467.79 11597.3 1.01
1024 14371.86 14381.97 1.00
2048 15551.36 15537.17 1.00
4096 14545.61 14522.35 1.00
8192 11896.39 11874.36 1.00

Single GPU: RTX Pro 6000 Blackwell

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 1656.24 1686.89 1.02
1024 1567.71 1597.92 1.02
2048 1455.61 1484.2 1.02
4096 1235.89 1314.54 1.06
8192 953.8 976.03 1.02

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 12109.8 12031.88 0.99
1024 11426.8 11426.9 1.00
2048 10589.84 10594.17 1.00
4096 8902.05 8909.34 1.00
8192 6874.66 6872.47 1.00

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
@loci-review
Copy link

loci-review bot commented Jan 23, 2026

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The static analysis and AI-based performance predictions indicate that the code modifications in this version do not introduce measurable impacts to response time or throughput time for any functions in the llama.cpp binaries.

This suggests that the changes between versions are either:

  • Non-performance-affecting code modifications (e.g., refactoring, code cleanup, documentation)
  • Changes to non-critical code paths with negligible execution time
  • Modifications that maintain performance parity with the previous version

No further performance review is warranted for this comparison.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from edd4e32 to d549af4 Compare January 27, 2026 06:14
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 96d29ac to dbad616 Compare January 31, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants