CUDA: Factor out and re-use `block_reduce` function by ORippler · Pull Request #18785 · ggml-org/llama.cpp

ORippler · 2026-01-12T15:09:55Z

This was an open TODO from #17004 on CUDA side

Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce)

ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

Also integrate it into norm_f32 function

@am17an

Also adresss other requests by @am17an such as variable renaming

Unit-tests passed locally, let's see if they pass in the CI as well

This is more type-safe than plain enum

ORippler · 2026-01-13T14:28:19Z

Added type traits and expanded to cover all types supported for warp_reduce_sum/warp_reduce_max.

@JohannesGaessler Do you know if there was a reason we kept rms_norm_back disabled for ncols % WARP_SIZE != 0? 8bc326a locally tests pass for me on CUDA even for the above case, so I enabled support to see how it behaves in CI

am17an · 2026-01-13T14:50:10Z

ggml/src/ggml-cuda/common.cuh

+};
+
+template <block_reduce_method reduce_method_t, const unsigned int block_size_template = 0, typename T>
+static __device__ T block_reduce(T val, T * shared_vals) {


maybe this should be called block_reduce_1d, users might expect block_reduce to reduce any dimension of block. Or perhaps we can add an assert that blockDim.y == 1

JohannesGaessler

I think it would be good to have a more templated approach to reduction like this in the CUDA backend, but I think we should aim to do this consistently for both warp-wise and block-wise reductions.

I don't remember why I put in the restriction for the backwards pass, if the corresponding test-backend-ops grad passes it is fine to remove.

This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

@am17an

* CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ORippler added 2 commits January 12, 2026 12:35

CUDA: Refactor and expose two_stage_warp_reduce_* function

418fb72

loci-dev mentioned this pull request Jan 12, 2026

UPSTREAM PR #18785: CUDA: Factor out and re-use two_stage_warp_reduce function auroralabs-loci/llama.cpp#897

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 12, 2026

am17an reviewed Jan 13, 2026

View reviewed changes

ORippler and others added 11 commits January 13, 2026 15:15

Update ggml/src/ggml-cuda/common.cuh

c63c148

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

Use two_stage_warp_reduce in group_norm_f32

41bab53

Use two_stage_warp_reduce in rms_norm_f32

a40b15f

Fix smem calculation which expects bytes

dbd449e

Make two_stage_warp_reduce accept all values warp_reduce accepts

7f43e64

Also integrate it into norm_f32 function

Use two_stage_warp_reduce in l2_norm_f32

67a9c13

Use type traits for block reduction for better legibility

612874f

Also adresss other requests by @am17an such as variable renaming

Make norm tests cover all cuda paths

82a3458

Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

8bc326a

Unit-tests passed locally, let's see if they pass in the CI as well

Use enum class for block_reduce_method

bd6ffff

This is more type-safe than plain enum

Rename variables as suggested in code review by @am17an

c1a048b

ORippler requested a review from ggerganov as a code owner January 13, 2026 14:21

Rename two_stage_warp_reduce -> block_reduce

767eba9

ORippler changed the title ~~CUDA: Factor out and re-use two_stage_warp_reduce function~~ CUDA: Factor out and re-use block_reduce function Jan 13, 2026

am17an approved these changes Jan 13, 2026

View reviewed changes

JohannesGaessler approved these changes Jan 13, 2026

View reviewed changes

github-actions bot added the testing Everything test related label Jan 13, 2026

ORippler added 3 commits January 13, 2026 20:49

Fix trailing whitespace in common.cuh

0ed3721

Inline definitions

38e3040

am17an merged commit 36f0132 into ggml-org:master Jan 15, 2026
74 of 76 checks passed

ORippler deleted the osimons/factor_out_two_stage_warp_reductions branch March 13, 2026 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Factor out and re-use `block_reduce` function#18785

CUDA: Factor out and re-use `block_reduce` function#18785
am17an merged 17 commits intoggml-org:masterfrom
ORippler:osimons/factor_out_two_stage_warp_reductions

ORippler commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ORippler commented Jan 13, 2026

Uh oh!

am17an Jan 13, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ORippler commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ORippler commented Jan 13, 2026

Uh oh!

am17an Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants