CUDA: Factor out and re-use block_reduce function#18785
CUDA: Factor out and re-use block_reduce function#18785am17an merged 17 commits intoggml-org:masterfrom
block_reduce function#18785Conversation
Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce)
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Also integrate it into norm_f32 function
Also adresss other requests by @am17an such as variable renaming
Unit-tests passed locally, let's see if they pass in the CI as well
This is more type-safe than plain enum
|
Added type traits and expanded to cover all types supported for @JohannesGaessler Do you know if there was a reason we kept rms_norm_back disabled for |
two_stage_warp_reduce functionblock_reduce function
| }; | ||
|
|
||
| template <block_reduce_method reduce_method_t, const unsigned int block_size_template = 0, typename T> | ||
| static __device__ T block_reduce(T val, T * shared_vals) { |
There was a problem hiding this comment.
maybe this should be called block_reduce_1d, users might expect block_reduce to reduce any dimension of block. Or perhaps we can add an assert that blockDim.y == 1
JohannesGaessler
left a comment
There was a problem hiding this comment.
I think it would be good to have a more templated approach to reduction like this in the CUDA backend, but I think we should aim to do this consistently for both warp-wise and block-wise reductions.
I don't remember why I put in the restriction for the backwards pass, if the corresponding test-backend-ops grad passes it is fine to remove.
This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
* CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
This was an open TODO from #17004 on CUDA side