CUDA: better coalesce data-access for contiguous concat by ORippler · Pull Request #22330 · ggml-org/llama.cpp

ORippler · 2026-04-24T19:20:05Z

Overview

Stumbled upon this when looking at an nsight trace for a hybrid mamba model.
Nsight compute revealed uncoalesced data-access (especially stores) for typical LLM token-gen workloads in addition to launching a lot of CTAs. This PR addresses these two points, leading to 1-3% E2E perf gain.

I did not find perf tests in test-backend-ops for this, so I didn't have any other workloads to verify against perf regression (PP hits the non-cont concat_f32_non_cont kernel).

Additional information

./scripts/compare-llama-bench.py -b 15fa3c4 -c osimons/CUDA_faster_cont_concat --tool llama-bench -i llama-bench.sqlite

Model	Test	t/s master	t/s osimons/CUDA_faster_cont_concat	Speedup
nemotron_h_moe 31B.A3.5B NVFP4	tg128	188.30	192.70	1.02
nemotron_h_moe 31B.A3.5B NVFP4	tg128@d32768	173.95	175.01	1.01
nemotron_h_moe 31B.A3.5B Q4_K_M	tg128	211.40	218.11	1.03
nemotron_h_moe 31B.A3.5B Q4_K_M	tg128@d32768	199.50	204.58	1.03
qwen35 27B Q4_K_M	tg128	39.42	40.08	1.02
qwen35 27B Q4_K_M	tg128@d32768	35.65	36.21	1.02
qwen35moe 35B.A3B Q4_K_M	tg128	164.40	169.44	1.03
qwen35moe 35B.A3B Q4_K_M	tg128@d32768	143.53	147.15	1.03

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, scoped the task for it to solve and reviewed, verified + cleaned up the code afterwards

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

CUDA: better coalesce data-access for contiguous concat

a027d4f

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

ORippler requested a review from a team as a code owner April 24, 2026 19:20

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 24, 2026

JohannesGaessler approved these changes Apr 25, 2026

View reviewed changes

am17an approved these changes Apr 25, 2026

View reviewed changes

JohannesGaessler merged commit b1a5bd4 into ggml-org:master Apr 26, 2026
45 of 47 checks passed

ORippler deleted the osimons/CUDA_faster_cont_concat branch April 27, 2026 09:04

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

74b24f8

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

56b9eba

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

7a78b6a

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

8a822d4

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

8760521

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

CUDA: better coalesce data-access for contiguous concat (ggml-org#22330)

72673ea

Also, distribute all elements across CTAs evenly instead of launching one CTA per dim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: better coalesce data-access for contiguous concat#22330

CUDA: better coalesce data-access for contiguous concat#22330
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
ORippler:osimons/CUDA_faster_cont_concat

ORippler commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ORippler commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ORippler commented Apr 24, 2026 •

edited

Loading