Skip to content

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels.#33858

Merged
simon-mo merged 1 commit intovllm-project:mainfrom
pavanimajety:fix-kimi-k2p5
Feb 5, 2026
Merged

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels.#33858
simon-mo merged 1 commit intovllm-project:mainfrom
pavanimajety:fix-kimi-k2p5

Conversation

@pavanimajety
Copy link
Collaborator

@pavanimajety pavanimajety commented Feb 5, 2026

Purpose

This PR fixes a bug introduced in PR #33174 that sets the values for n_group and topk_group to None when they are (1, 1) respectively. This while it fixes Kimi-K2 may introduce an error with Mistral. @dbari Please confirm if this fix is good or if the values need to be passed differently

The marlin path works because it doesn't have monolithic kernel for routing + MOE unlike the INT4 TRTLLM MOE Kernels.

Test Plan

GSM8k before and after.

Test Result

Main Kimi-K2-Thinking (Buggy)

Marlin:
Accuracy: 0.914
Invalid responses: 0.000
Total latency: 78.938 s
Questions per second: 16.709
Total output tokens: 132921
Output tokens per second: 1683.870

Flashinfer
Accuracy: 0.299
Invalid responses: 0.008
Total latency: 81.379 s
Questions per second: 16.208
Total output tokens: 131721
Output tokens per second: 1618.620

With PR - (Fixed)

Marlin: 
Results:
Accuracy: 0.909
Invalid responses: 0.001
Total latency: 81.228 s
Questions per second: 16.238
Total output tokens: 134196
Output tokens per second: 1652.097

Flashinfer:
Results:
Accuracy: 0.917
Invalid responses: 0.000
Total latency: 78.991 s
Questions per second: 16.698
Total output tokens: 130950
Output tokens per second: 1657.787

Kimi-K2.5 + Flashinfer

Accuracy: 0.945
Invalid responses: 0.000
Total latency: 77.760 s
Questions per second: 16.962
Total output tokens: 131352
Output tokens per second: 1689.195

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Pavani Majety <pmajety@nvidia.com>
@mergify mergify bot added deepseek Related to DeepSeek models bug Something isn't working labels Feb 5, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the DeepseekV2MoE layer where grouped_topk routing was incorrectly disabled for the specific case of (n_group, topk_group) == (1, 1). This caused issues for models like Kimi-K2 that rely on this configuration, particularly when an e_score_correction_bias is used, as the fallback routing mechanism did not account for it. The fix removes this special condition, ensuring that GroupedTopKRouter is consistently used, which correctly handles all configurations, including the (1, 1) case. The resulting code is cleaner and more robust. The significant improvement in accuracy demonstrated in the test results validates this change. The concern regarding Mistral appears to be related to model configurations rather than a direct issue with this code modification, as Mistral models are handled by a separate implementation.

@pavanimajety pavanimajety requested a review from ywang96 February 5, 2026 05:46
@zhewenl
Copy link
Collaborator

zhewenl commented Feb 5, 2026

verified AIME and GSM8K passed, thanks for the fix!

lm_eval --model local-completions \
  --model_args "model=moonshotai/Kimi-K2.5,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=64,timeout=5000,max_length=131072" \
  --tasks gsm8k \
  --num_fewshot 5
  
  |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9416|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9409|±  |0.0065|
lm_eval --model local-chat-completions --model_args "model=moonshotai/Kimi-K2.5,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768" --tasks aime25 --apply_chat_template --gen_kwargs '{"temperature":1.0,"max_gen_toks":72768,"top_p":0.95,"chat_template_kwargs":{"thinking":true}}' --log_samples --output_path "aime25_ds32"
|Tasks |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|------|------:|------|-----:|-----------|---|----:|---|-----:|
|aime25|      0|none  |     0|exact_match|↑  |  0.9|±  |0.0557|

@zhewenl zhewenl added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 5, 2026
@simon-mo simon-mo enabled auto-merge (squash) February 5, 2026 07:46
@simon-mo simon-mo merged commit d2f4a71 into vllm-project:main Feb 5, 2026
57 of 58 checks passed
@dbari dbari mentioned this pull request Feb 5, 2026
5 tasks
koush pushed a commit to koush/vllm that referenced this pull request Feb 5, 2026
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants