Use a single warp per element instead of a single block per element if the K-dimension is small by gaugarg-nv · Pull Request #13 · JohannesGaessler/llama.cpp

gaugarg-nv · 2026-03-15T11:43:18Z

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has a k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change uses a single warp for inner dot product and increases the number of output elements per block instead.

For Qwen3235B-A22B FFN-down GEMV kernel, it shows an improvement of 1.7x (18.88 microseconds to 11.088 microseconds).

…f the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change uses a single warp for inner dot product and increases the number of output elements per block instead.

gaugarg-nv · 2026-03-15T15:38:00Z

Closing this PR, will file one on master.

gaugarg-nv added 2 commits March 15, 2026 17:00

Use small_k only for a strict less value

610318d

gaugarg-nv requested a review from JohannesGaessler as a code owner March 15, 2026 11:43

gaugarg-nv changed the base branch from master to ggml-meta-backend-8 March 15, 2026 11:43

gaugarg-nv mentioned this pull request Mar 15, 2026

ggml: backend-agnostic tensor parallelism ggml-org/llama.cpp#19378

Open

gaugarg-nv closed this Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single warp per element instead of a single block per element if the K-dimension is small#13

Use a single warp per element instead of a single block per element if the K-dimension is small#13
gaugarg-nv wants to merge 2 commits intoJohannesGaessler:ggml-meta-backend-8from
gaugarg-nv:small_k_opt

gaugarg-nv commented Mar 15, 2026

Uh oh!

gaugarg-nv commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaugarg-nv commented Mar 15, 2026

Uh oh!

gaugarg-nv commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant