[CUDA] Increase number of output elements per-thread block if the K-dimension is small#20635
[CUDA] Increase number of output elements per-thread block if the K-dimension is small#20635gaugarg-nv wants to merge 3 commits intoggml-org:masterfrom
Conversation
|
This also helps in Qwen3.5 which has a down shape of 512 |
Adding Qwen3.5-35B-A3B data on RTX Pro 6000 BW:
|
ggml/src/ggml-cuda/mmvq.cu
Outdated
| { \ | ||
| constexpr int c_ncols_dst = C_NCOLS_DST; \ | ||
| const int nwarps = calc_nwarps(type, c_ncols_dst, table_id); \ | ||
| const bool use_small_k = nwarps > 1 && blocks_per_row_x < nwarps * blocks_per_iter_1warp; \ |
There was a problem hiding this comment.
can this be done inside the cuda kernel using multiplications by re-ordering this expression? It would simplify the code quite a bit
There was a problem hiding this comment.
To be clear, are you suggesting to remove small_k template parameter from the kernel? I think that should be doable.
But we will still need this code on the host as we modify rows_per_block for small_k, which in turn modifies the grid dimensions.
There was a problem hiding this comment.
Yes exactly. On the host side we can create a function to return the correct dims instead of doing the if/else and the macro
There was a problem hiding this comment.
I will try to simplify the host code.
But I think removing small_k as a template parameter from the kernel would mean rows_per_cuda_block can no longer be constexpr. And some of the local register and shared memory allocation depend on this value to be constexpr.
There was a problem hiding this comment.
I see. Then it makes sense to leave the template parameter as it is. My only worry was the compile times/binary sizes. Do you notice any difference in them? If you're using ninja build it should be pretty easy to see via .ninja_log
There was a problem hiding this comment.
@am17an Sorry for the late follow-up on this. I was busy with some other work.
Regarding build times, I am seeing an increase of 12 seconds and an increase of 2MB in libggml-cuda.so.
IMO, given that most SOTA models are MOE, it is worth taking this hit.
Regarding host-side code simplification, I think we can NOT avoid if-else as small_k is a template parameter.
Let me know if there are any specific ideas you have regarding code simplification.
There was a problem hiding this comment.
Are you compiling only for blackwell? I think this change will slow down the CI. We can limit this change to ncols_dst = 1, since that 99% of the use-case.
228.646s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o
169.613s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q2_k.cu.o
163.399s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq256-dv256.cu.o
153.896s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
147.442s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_1.cu.o
144.375s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q3_k.cu.o
144.037s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_s.cu.o
139.735s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_0.cu.o
138.991s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xs.cu.o
132.618s ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_64-ncols2_1.cu.o
There was a problem hiding this comment.
Ok, I have limited the change to ncols_dst = 1 for now. This would obviously mean no scaling for BS > 1. I will spend more time on this kernel later and see how we can add specialization for small-k without adding too much compilation time and library size.
There was a problem hiding this comment.
The typical way is to separate ncols into separate template files. But I think it might be overkill for now
ggml/src/ggml-cuda/mmvq.cu
Outdated
| // When K is small, increase rows_per_block to match nwarps so each warp | ||
| // processes a different row. This amortizes y-vector reads and reduces block count. | ||
| // Trigger when the full thread block covers all K blocks in a single loop iteration. |
There was a problem hiding this comment.
If you increase rows_per_block that will not result in different warps working on different src0 rows. Each thread will still work on every row/column assigned to a CUDA block, all that changes is that the inner loop is over more rows so the compiler should in principle be able to recognize that the same data is being loaded multiple times. To avoid wasted work I would suggest you reduce the number of warps per CUDA block instead.
Regardless of the above, it may very well be that increasing rows_per_block is beneficial on Blackwell (in general, I did not test this). Did you test the impact of your small_k config for larger matrices.
Generally speaking, MMVQ is of comparatively poor code quality for historical reasons. It's among the first kernels that I wrote so I was less experienced (llama.cpp/ggml was my first contact with CUDA) and I didn't yet find the time to circle back to it to look into how it could be improved. It may be worthwhile for you to look for optimization opportunities beyond applications for TP.
There was a problem hiding this comment.
If you increase rows_per_block that will not result in different warps working on different src0 rows. Each thread will still work on every row/column assigned to a CUDA block, all that changes is that the inner loop is over more rows so the compiler should in principle be able to recognize that the same data is being loaded multiple times.
Thanks @JohannesGaessler for the pointers. You're right about the kernel. I will update the comments.
To avoid wasted work I would suggest you reduce the number of warps per CUDA block instead.
I tried setting n_warps to 1 or 2 for small_k without modifying rows_per_block . None of them show any perf improvements. In general, I think too small CTAs are not very efficient.
Regardless of the above, it may very well be that increasing rows_per_block is beneficial on Blackwell (in general, I did not test this). Did you test the impact of your small_k config for larger matrices.
Yes, I tried setting small_k to true for all cases, which would increase rows_per_block in all cases. This gave me a regression of 4-6% for a few models in the BS=1 case. So, rejected the idea.
It may be worthwhile for you to look for optimization opportunities beyond applications for TP.
Yes, the main motivation of this work was Tensor parallelism. I will explore more ideas.
In general, for BS=1, the performance of kernels looks good except for some corner cases like small-k. I think one idea worth pursuing is using 1 warp per element for small-K, so that we avoid using shared memory and use light-weight warpReduce instead of block-level reduce. I can try experimenting with this idea.
Would you like to proceed with this PR first, or should we do more exploration around these ideas first?
There was a problem hiding this comment.
I would say that if this PR empirically improves performance for some cases we should keep it, just change the rationale since it is misleading.
There was a problem hiding this comment.
@JohannesGaessler updated the comments and PR description.
With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases.
4f20a44 to
cfbbfb2
Compare
The K-dimension (inner dot product dimension) of the FFN-down matrices can be quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3-235B-A22B has a k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.
This change increases the number of output elements per block for such matrices.
This change is also helpful for Tensor parallelism (PR #19378), where FFN-down is split along the K dimension.
Single GPU Performance on 1x RTX Pro 6000 Blackwell
Tensor Parallelism Performance on 2x RTX Pro 6000 Blackwell with PR 19378