convert : enable expert group selection for all models with it #16691

CISC · 2025-10-20T20:03:02Z

Enable expert group selection for all models with it (requires reconversion or metadata editing).

Specifically, these models (maybe more):

DeepSeekV2.5 and onwards (DeepSeekV2.5 only used a single token instead of the sum of two, but the rest is the same)
Dots1
Glm4Moe

CISC · 2025-10-20T20:06:57Z

@jeffbolznv @am17an It would be great to have fusion for this as well. :)

Lines 953 to 976 in 84bf3c6

    
           // select top n_group_used expert groups 
        
           // https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/e815299b0bcbac849fa540c768ef21845365c9eb/modeling_deepseek.py#L440-L457 
        
           if (hparams.n_expert_groups > 1 && n_tokens > 0) { 
        
               const int64_t n_exp_per_group = n_expert / hparams.n_expert_groups; 
        
               // organize experts into n_expert_groups 
        
               ggml_tensor * selection_groups = ggml_reshape_3d(ctx0, selection_probs, n_exp_per_group, hparams.n_expert_groups, n_tokens); // [n_exp_per_group, n_expert_groups, n_tokens] 
        
               ggml_tensor * group_scores = ggml_top_k(ctx0, selection_groups, 2); // [2, n_expert_groups, n_tokens] 
        
               group_scores = ggml_get_rows(ctx0, ggml_reshape_4d(ctx0, selection_groups, 1, selection_groups->ne[0], selection_groups->ne[1], selection_groups->ne[2]), group_scores); // [1, 2, n_expert_groups, n_tokens] 
        
               // get top n_group_used expert groups 
        
               group_scores = ggml_sum_rows(ctx0, ggml_reshape_3d(ctx0, group_scores, group_scores->ne[1], group_scores->ne[2], group_scores->ne[3])); // [1, n_expert_groups, n_tokens] 
        
               group_scores = ggml_reshape_2d(ctx0, group_scores, group_scores->ne[1], group_scores->ne[2]); // [n_expert_groups, n_tokens] 
        
               ggml_tensor * expert_groups = ggml_top_k(ctx0, group_scores, hparams.n_group_used); // [n_group_used, n_tokens] 
        
               cb(expert_groups, "ffn_moe_group_topk", il); 
        
               // mask out the other groups 
        
               selection_probs = ggml_get_rows(ctx0, selection_groups, expert_groups); // [n_exp_per_group, n_group_used, n_tokens] 
        
               selection_probs = ggml_set_rows(ctx0, ggml_scale_bias(ctx0, selection_groups, 0.0f, -INFINITY), selection_probs, expert_groups); // [n_exp_per_group, n_expert_groups, n_tokens] 
        
               selection_probs = ggml_reshape_2d(ctx0, selection_probs, n_expert, n_tokens); // [n_expert, n_tokens] 
        
               cb(selection_probs, "ffn_moe_probs_masked", il); 
        
           }

jeffbolznv · 2025-10-20T20:10:09Z

@jeffbolznv @am17an It would be great to have fusion for this as well. :)

Agreed. Let's catch up with the clamping change and get these common code utilities in, then I agree we should do this.

CISC · 2025-10-26T15:13:06Z

@ggerganov gentle ping

…org#16691)

jukofyork · 2025-10-29T12:04:56Z

Enable expert group selection for all models with it (requires reconversion or metadata editing).

Specifically, these models (maybe more):

* [DeepSeekV2.5 and onwards](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L209-L222) (DeepSeekV2.5 only used a [single token](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/c85b5ede86f2a598af339624cac5723861e557ed/modeling_deepseek.py#L440-L442) instead of the sum of two, but the rest is the same)

* [Dots1](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/dots1/modeling_dots1.py#L365-L376)

* [Glm4Moe](https://github.com/huggingface/transformers/blob/12a50f294d50e3d0e124511f2b6f43625f73ffce/src/transformers/models/glm4_moe/modeling_glm4_moe.py#L389-L403)

Kimi-K2 also has it, but it's just 1 group:

  "n_group": 1,
  "topk_group": 1,

I've patched my deepseek-r1 GGUF to use it:

llama_model_loader: - kv  46:               deepseek2.expert_group_count u32              = 8
llama_model_loader: - kv  47:          deepseek2.expert_group_used_count u32              = 4

print_info: n_expert_groups  = 8
print_info: n_group_used     = 4

and it's reduced the tg by around 5-6% and pp by around 10%.

am17an · 2025-10-29T12:17:04Z

My guess is that this doesn't take the topk-moe path because of this change

CISC · 2025-10-29T12:17:56Z

Kimi-K2 also has it, but it's just 1 group:

Yes, effectively disabling it.

I've patched my deepseek-r1 GGUF to use it:
and it's reduced the tg by around 5-6% and pp by around 10%.

To be expected, should be recoverable by fusion.

am17an · 2025-10-29T12:24:57Z

Also I didn't expect such a drastic drop in PP due to missing fusion. @jukofyork what is your setup?

jukofyork · 2025-10-29T12:40:49Z

Also I didn't expect such a drastic drop in PP due to missing fusion. @jukofyork what is your setup?

Yeah, I think I may have been wrong about this as it's been a while since I compiled a new version of llama.cpp on this machine - just noticed kimi-k2 (which I haven't touched) got a similar drop in PP, so it is probably just the TG that has dropped.

jukofyork · 2025-10-29T19:11:44Z

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 23573
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 12288, progress = 0.521274
/home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error
ggml_cuda_compute_forward: ARGSORT failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2692
  err
/home/juk/llama.cpp/build/bin/libggml-base.so(+0x16298)[0x7fbd5e26e298]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x1e4)[0x7fbd5e26e664]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x11e)[0x7fbd5e26e7ee]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x1225e3)[0x7fbd56d225e3]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x13077f)[0x7fbd56d3077f]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x1337ef)[0x7fbd56d337ef]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x807)[0x7fbd5e288a67]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7fbd5e099411]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe1)[0x7fbd5e09aa81]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x291)[0x7fbd5e0a0271]
/home/juk/llama.cpp/build/bin/libllama.so(llama_decode+0xb)[0x7fbd5e0a116b]
/home/juk/llama.cpp/build/bin/llama-server(+0xdbdb2)[0x558fb2853db2]
/home/juk/llama.cpp/build/bin/llama-server(+0xa1fec)[0x558fb2819fec]
/home/juk/llama.cpp/build/bin/llama-server(+0x6196f)[0x558fb27d996f]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fbd5da4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fbd5da46305]
/home/juk/llama.cpp/build/bin/llama-server(+0x636b1)[0x558fb27db6b1]
grep: (standard input): binary file matches

getting a segfault that doesn't occur when I override the keys:

print_info: n_expert_groups  = 1
print_info: n_group_used     = 1

This version of llama.cpp has a few patches in to do things like increase the offload size (to 2560 tokens minimum), custom speculative decoding code, etc, but the crash seemed to happen as soon as it was given a very large prompt to process (so doubt the custom speculative decoding code is at fault).

jukofyork · 2025-10-29T19:28:09Z

It seems that the crash occurs when ubatch >= 8192. It works fine when ubatch < 8192.

The only limit I can see to do with GGML_OP_ARGSORT is this:

        case GGML_OP_ARGSORT:
#ifndef GGML_CUDA_USE_CUB
            return op->src[0]->ne[0] <= 1024;
#else   
            return true;
#endif

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

CISC · 2025-10-29T19:30:34Z

This version of llama.cpp has a few patches in to do things like increase the offload size (to 2560 tokens minimum), custom speculative decoding code, etc, but the crash seemed to as soon as it was given a very large prompt to process so doubt the custom speculative decoding code is at fault.

Provided this version has the recent ARGSORT changes (CPU fallback or CUB implementation) from master, I think this can only happen if nrows are larger than 64k, ie n_expert_groups * n_tokens.

CISC · 2025-10-29T19:32:17Z

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

Ah, yes, that would be it.

CISC · 2025-10-29T19:34:05Z

I guess we need a check and fallback to CUB.

jukofyork · 2025-10-29T19:35:20Z

so wonder if it is to do with n_expert_groups = 8 causing 8x this value?

Ah, yes, that would be it.

There is another test against 1024 here:

llama.cpp/ggml/src/ggml-cuda/argsort.cu

Line 191 in 3464bda

if (shared_mem > max_shared_mem || ncols > 1024) {

CISC · 2025-10-29T19:39:41Z

There is another test against 1024 here:

Sure, but that's ncols (used in x), which is not the issue, it's nrows (used in 16bit y) that causes the error.

CISC · 2025-10-29T19:42:34Z

I think we can swap the parameters instead of falling back to CUB.

CISC · 2025-10-29T19:55:07Z

Yep, worked, making a PR.

jukofyork · 2025-10-29T21:02:50Z

My issue was fixed by #16849.

enable expert group selection for all models with it

b32af93

CISC requested a review from ggerganov October 20, 2025 20:03

github-actions bot added the python python script changes label Oct 20, 2025

ggerganov approved these changes Oct 26, 2025

View reviewed changes

CISC merged commit 73a48c9 into master Oct 26, 2025
76 of 77 checks passed

CISC deleted the cisc/enable-expert-group-selection branch October 26, 2025 16:21

Stealt91 mentioned this pull request Oct 27, 2025

Re-do all calibration data with KLD results and ubergarm's imatrix calibration data set Thireus/GGUF-Tool-Suite#34

Open

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 27, 2025

convert : enable expert group selection for all models with it (ggml-…

4de7fb8

…org#16691)

theo77186 pushed a commit to theo77186/llama.cpp that referenced this pull request Oct 28, 2025

convert : enable expert group selection for all models with it (ggml-…

20b91ba

…org#16691)

jukofyork mentioned this pull request Oct 29, 2025

cuda : fix argsort with 64k+ rows #16849

Merged

convert : enable expert group selection for all models with it #16691

convert : enable expert group selection for all models with it #16691

Conversation

CISC commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 20, 2025

Uh oh!

jeffbolznv commented Oct 20, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

jukofyork commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CISC commented Oct 20, 2025 •

edited

Loading

CISC commented Oct 29, 2025 •

edited

Loading

am17an commented Oct 29, 2025 •

edited

Loading

jukofyork commented Oct 29, 2025 •

edited

Loading