UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006
Open
UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#1006
Conversation
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
|
Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The static analysis and AI-based performance predictions indicate that the code modifications in this version do not introduce measurable impacts to response time or throughput time for any functions in the llama.cpp binaries. This suggests that the changes between versions are either:
No further performance review is warranted for this comparison. See the complete breakdown in Version Insights |
edd4e32 to
d549af4
Compare
96d29ac to
dbad616
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#19042
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
The NSight profile below shows the issue in more detail:
After setting the environment variable, this is how the new profile looks:
The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.
Performance Gains
Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Single GPU: RTX Pro 6000 Blackwell
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf