Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Oct 20, 2025

Enable expert group selection for all models with it (requires reconversion or metadata editing).

Specifically, these models (maybe more):

@CISC CISC requested a review from ggerganov October 20, 2025 20:03
@CISC
Copy link
Collaborator Author

CISC commented Oct 20, 2025

@jeffbolznv @am17an It would be great to have fusion for this as well. :)

// select top n_group_used expert groups
// https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/e815299b0bcbac849fa540c768ef21845365c9eb/modeling_deepseek.py#L440-L457
if (hparams.n_expert_groups > 1 && n_tokens > 0) {
const int64_t n_exp_per_group = n_expert / hparams.n_expert_groups;
// organize experts into n_expert_groups
ggml_tensor * selection_groups = ggml_reshape_3d(ctx0, selection_probs, n_exp_per_group, hparams.n_expert_groups, n_tokens); // [n_exp_per_group, n_expert_groups, n_tokens]
ggml_tensor * group_scores = ggml_top_k(ctx0, selection_groups, 2); // [2, n_expert_groups, n_tokens]
group_scores = ggml_get_rows(ctx0, ggml_reshape_4d(ctx0, selection_groups, 1, selection_groups->ne[0], selection_groups->ne[1], selection_groups->ne[2]), group_scores); // [1, 2, n_expert_groups, n_tokens]
// get top n_group_used expert groups
group_scores = ggml_sum_rows(ctx0, ggml_reshape_3d(ctx0, group_scores, group_scores->ne[1], group_scores->ne[2], group_scores->ne[3])); // [1, n_expert_groups, n_tokens]
group_scores = ggml_reshape_2d(ctx0, group_scores, group_scores->ne[1], group_scores->ne[2]); // [n_expert_groups, n_tokens]
ggml_tensor * expert_groups = ggml_top_k(ctx0, group_scores, hparams.n_group_used); // [n_group_used, n_tokens]
cb(expert_groups, "ffn_moe_group_topk", il);
// mask out the other groups
selection_probs = ggml_get_rows(ctx0, selection_groups, expert_groups); // [n_exp_per_group, n_group_used, n_tokens]
selection_probs = ggml_set_rows(ctx0, ggml_scale_bias(ctx0, selection_groups, 0.0f, -INFINITY), selection_probs, expert_groups); // [n_exp_per_group, n_expert_groups, n_tokens]
selection_probs = ggml_reshape_2d(ctx0, selection_probs, n_expert, n_tokens); // [n_expert, n_tokens]
cb(selection_probs, "ffn_moe_probs_masked", il);
}

@jeffbolznv
Copy link
Collaborator

@jeffbolznv @am17an It would be great to have fusion for this as well. :)

Agreed. Let's catch up with the clamping change and get these common code utilities in, then I agree we should do this.

@github-actions github-actions bot added the python python script changes label Oct 20, 2025
@CISC
Copy link
Collaborator Author

CISC commented Oct 26, 2025

@ggerganov gentle ping

@CISC CISC merged commit 73a48c9 into master Oct 26, 2025
76 of 77 checks passed
@CISC CISC deleted the cisc/enable-expert-group-selection branch October 26, 2025 16:21
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 27, 2025
theo77186 pushed a commit to theo77186/llama.cpp that referenced this pull request Oct 28, 2025
@jukofyork
Copy link
Collaborator

Enable expert group selection for all models with it (requires reconversion or metadata editing).

Specifically, these models (maybe more):

* [DeepSeekV2.5 and onwards](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L209-L222) (DeepSeekV2.5 only used a [single token](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/c85b5ede86f2a598af339624cac5723861e557ed/modeling_deepseek.py#L440-L442) instead of the sum of two, but the rest is the same)

* [Dots1](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/dots1/modeling_dots1.py#L365-L376)

* [Glm4Moe](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/glm4_moe/modeling_glm4_moe.py#L389-L403)

Kimi-K2 also has it, but it's just 1 group:

  "n_group": 1,
  "topk_group": 1,

I've patched my deepseek-r1 GGUF to use it:

llama_model_loader: - kv  46:               deepseek2.expert_group_count u32              = 8
llama_model_loader: - kv  47:          deepseek2.expert_group_used_count u32              = 4
print_info: n_expert_groups  = 8
print_info: n_group_used     = 4

and it's reduced the tg by around 5-6% and pp by around 10%.

@am17an
Copy link
Collaborator

am17an commented Oct 29, 2025

My guess is that this doesn't take the topk-moe path because of this change

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

Kimi-K2 also has it, but it's just 1 group:

Yes, effectively disabling it.

I've patched my deepseek-r1 GGUF to use it:
and it's reduced the tg by around 5-6% and pp by around 10%.

To be expected, should be recoverable by fusion.

@am17an
Copy link
Collaborator

am17an commented Oct 29, 2025

Also I didn't expect such a drastic drop in PP due to missing fusion. @jukofyork what is your setup?

@jukofyork
Copy link
Collaborator

Also I didn't expect such a drastic drop in PP due to missing fusion. @jukofyork what is your setup?

Yeah, I think I may have been wrong about this as it's been a while since I compiled a new version of llama.cpp on this machine - just noticed kimi-k2 (which I haven't touched) got a similar drop in PP, so it is probably just the TG that has dropped.

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 29, 2025

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 23573
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 12288, progress = 0.521274
/home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error
ggml_cuda_compute_forward: ARGSORT failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2692
  err
/home/juk/llama.cpp/build/bin/libggml-base.so(+0x16298)[0x7fbd5e26e298]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x1e4)[0x7fbd5e26e664]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x11e)[0x7fbd5e26e7ee]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x1225e3)[0x7fbd56d225e3]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x13077f)[0x7fbd56d3077f]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x1337ef)[0x7fbd56d337ef]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x807)[0x7fbd5e288a67]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7fbd5e099411]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe1)[0x7fbd5e09aa81]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x291)[0x7fbd5e0a0271]
/home/juk/llama.cpp/build/bin/libllama.so(llama_decode+0xb)[0x7fbd5e0a116b]
/home/juk/llama.cpp/build/bin/llama-server(+0xdbdb2)[0x558fb2853db2]
/home/juk/llama.cpp/build/bin/llama-server(+0xa1fec)[0x558fb2819fec]
/home/juk/llama.cpp/build/bin/llama-server(+0x6196f)[0x558fb27d996f]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fbd5da4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fbd5da46305]
/home/juk/llama.cpp/build/bin/llama-server(+0x636b1)[0x558fb27db6b1]
grep: (standard input): binary file matches

getting a segfault that doesn't occur when I override the keys:

print_info: n_expert_groups  = 1
print_info: n_group_used     = 1

This version of llama.cpp has a few patches in to do things like increase the offload size (to 2560 tokens minimum), custom speculative decoding code, etc, but the crash seemed to happen as soon as it was given a very large prompt to process (so doubt the custom speculative decoding code is at fault).

@jukofyork
Copy link
Collaborator

It seems that the crash occurs when ubatch >= 8192. It works fine when ubatch < 8192.

The only limit I can see to do with GGML_OP_ARGSORT is this:

        case GGML_OP_ARGSORT:
#ifndef GGML_CUDA_USE_CUB
            return op->src[0]->ne[0] <= 1024;
#else   
            return true;
#endif

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

This version of llama.cpp has a few patches in to do things like increase the offload size (to 2560 tokens minimum), custom speculative decoding code, etc, but the crash seemed to as soon as it was given a very large prompt to process so doubt the custom speculative decoding code is at fault.

Provided this version has the recent ARGSORT changes (CPU fallback or CUB implementation) from master, I think this can only happen if nrows are larger than 64k, ie n_expert_groups * n_tokens.

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

Ah, yes, that would be it.

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

I guess we need a check and fallback to CUB.

@jukofyork
Copy link
Collaborator

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

Ah, yes, that would be it.

There is another test against 1024 here:

if (shared_mem > max_shared_mem || ncols > 1024) {

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

There is another test against 1024 here:

Sure, but that's ncols (used in x), which is not the issue, it's nrows (used in 16bit y) that causes the error.

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

I think we can swap the parameters instead of falling back to CUB.

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

Yep, worked, making a PR.

@jukofyork
Copy link
Collaborator

My issue was fixed by #16849.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants