Skip to content

Use a single warp per element instead of a single block per element if the K-dimension is small#13

Closed
gaugarg-nv wants to merge 2 commits intoJohannesGaessler:ggml-meta-backend-8from
gaugarg-nv:small_k_opt
Closed

Use a single warp per element instead of a single block per element if the K-dimension is small#13
gaugarg-nv wants to merge 2 commits intoJohannesGaessler:ggml-meta-backend-8from
gaugarg-nv:small_k_opt

Conversation

@gaugarg-nv
Copy link

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has a k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change uses a single warp for inner dot product and increases the number of output elements per block instead.

For Qwen3235B-A22B FFN-down GEMV kernel, it shows an improvement of 1.7x (18.88 microseconds to 11.088 microseconds).

…f the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change uses a single warp for inner dot product and increases the number of output elements per block instead.
@gaugarg-nv gaugarg-nv changed the base branch from master to ggml-meta-backend-8 March 15, 2026 11:43
@gaugarg-nv
Copy link
Author

Closing this PR, will file one on master.

@gaugarg-nv gaugarg-nv closed this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant