Skip to content

CUDA: Factor out and re-use block_reduce function#18785

Merged
am17an merged 17 commits intoggml-org:masterfrom
ORippler:osimons/factor_out_two_stage_warp_reductions
Jan 15, 2026
Merged

CUDA: Factor out and re-use block_reduce function#18785
am17an merged 17 commits intoggml-org:masterfrom
ORippler:osimons/factor_out_two_stage_warp_reductions

Conversation

@ORippler
Copy link
Collaborator

This was an open TODO from #17004 on CUDA side

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 12, 2026
@ORippler ORippler requested a review from ggerganov as a code owner January 13, 2026 14:21
@ORippler
Copy link
Collaborator Author

Added type traits and expanded to cover all types supported for warp_reduce_sum/warp_reduce_max.

@JohannesGaessler Do you know if there was a reason we kept rms_norm_back disabled for ncols % WARP_SIZE != 0? 8bc326a locally tests pass for me on CUDA even for the above case, so I enabled support to see how it behaves in CI

@ORippler ORippler changed the title CUDA: Factor out and re-use two_stage_warp_reduce function CUDA: Factor out and re-use block_reduce function Jan 13, 2026
};

template <block_reduce_method reduce_method_t, const unsigned int block_size_template = 0, typename T>
static __device__ T block_reduce(T val, T * shared_vals) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be called block_reduce_1d, users might expect block_reduce to reduce any dimension of block. Or perhaps we can add an assert that blockDim.y == 1

Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a more templated approach to reduction like this in the CUDA backend, but I think we should aim to do this consistently for both warp-wise and block-wise reductions.

I don't remember why I put in the restriction for the backwards pass, if the corresponding test-backend-ops grad passes it is fine to remove.

@github-actions github-actions bot added the testing Everything test related label Jan 13, 2026
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
@am17an am17an merged commit 36f0132 into ggml-org:master Jan 15, 2026
74 of 76 checks passed
@ORippler ORippler deleted the osimons/factor_out_two_stage_warp_reductions branch March 13, 2026 19:04
MaheshJakkala pushed a commit to MaheshJakkala/llama.cpp that referenced this pull request Mar 15, 2026
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants