Skip to content

CUDA: better coalesce data-access for contiguous concat#22330

Merged
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
ORippler:osimons/CUDA_faster_cont_concat
Apr 26, 2026
Merged

CUDA: better coalesce data-access for contiguous concat#22330
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
ORippler:osimons/CUDA_faster_cont_concat

Conversation

@ORippler
Copy link
Copy Markdown
Collaborator

@ORippler ORippler commented Apr 24, 2026

Overview

Stumbled upon this when looking at an nsight trace for a hybrid mamba model.
Nsight compute revealed uncoalesced data-access (especially stores) for typical LLM token-gen workloads in addition to launching a lot of CTAs. This PR addresses these two points, leading to 1-3% E2E perf gain.

I did not find perf tests in test-backend-ops for this, so I didn't have any other workloads to verify against perf regression (PP hits the non-cont concat_f32_non_cont kernel).

Additional information

  • ./scripts/compare-llama-bench.py -b 15fa3c4 -c osimons/CUDA_faster_cont_concat --tool llama-bench -i llama-bench.sqlite
Model Test t/s master t/s osimons/CUDA_faster_cont_concat Speedup
nemotron_h_moe 31B.A3.5B NVFP4 tg128 188.30 192.70 1.02
nemotron_h_moe 31B.A3.5B NVFP4 tg128@d32768 173.95 175.01 1.01
nemotron_h_moe 31B.A3.5B Q4_K_M tg128 211.40 218.11 1.03
nemotron_h_moe 31B.A3.5B Q4_K_M tg128@d32768 199.50 204.58 1.03
qwen35 27B Q4_K_M tg128 39.42 40.08 1.02
qwen35 27B Q4_K_M tg128@d32768 35.65 36.21 1.02
qwen35moe 35B.A3B Q4_K_M tg128 164.40 169.44 1.03
qwen35moe 35B.A3B Q4_K_M tg128@d32768 143.53 147.15 1.03

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, scoped the task for it to solve and reviewed, verified + cleaned up the code afterwards

Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
@ORippler ORippler requested a review from a team as a code owner April 24, 2026 19:20
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 24, 2026
@JohannesGaessler JohannesGaessler merged commit b1a5bd4 into ggml-org:master Apr 26, 2026
45 of 47 checks passed
@ORippler ORippler deleted the osimons/CUDA_faster_cont_concat branch April 27, 2026 09:04
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants