[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. by pavanimajety · Pull Request #33858 · vllm-project/vllm

pavanimajety · 2026-02-05T02:58:37Z

Purpose

This PR fixes a bug introduced in PR #33174 that sets the values for n_group and topk_group to None when they are (1, 1) respectively. This while it fixes Kimi-K2 may introduce an error with Mistral. @dbari Please confirm if this fix is good or if the values need to be passed differently

The marlin path works because it doesn't have monolithic kernel for routing + MOE unlike the INT4 TRTLLM MOE Kernels.

Test Plan

GSM8k before and after.

Test Result

Main Kimi-K2-Thinking (Buggy)

Marlin:
Accuracy: 0.914
Invalid responses: 0.000
Total latency: 78.938 s
Questions per second: 16.709
Total output tokens: 132921
Output tokens per second: 1683.870

Flashinfer
Accuracy: 0.299
Invalid responses: 0.008
Total latency: 81.379 s
Questions per second: 16.208
Total output tokens: 131721
Output tokens per second: 1618.620

With PR - (Fixed)

Marlin: 
Results:
Accuracy: 0.909
Invalid responses: 0.001
Total latency: 81.228 s
Questions per second: 16.238
Total output tokens: 134196
Output tokens per second: 1652.097

Flashinfer:
Results:
Accuracy: 0.917
Invalid responses: 0.000
Total latency: 78.991 s
Questions per second: 16.698
Total output tokens: 130950
Output tokens per second: 1657.787

Kimi-K2.5 + Flashinfer

Accuracy: 0.945
Invalid responses: 0.000
Total latency: 77.760 s
Questions per second: 16.962
Total output tokens: 131352
Output tokens per second: 1689.195

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

gemini-code-assist

Code Review

This pull request addresses a bug in the DeepseekV2MoE layer where grouped_topk routing was incorrectly disabled for the specific case of (n_group, topk_group) == (1, 1). This caused issues for models like Kimi-K2 that rely on this configuration, particularly when an e_score_correction_bias is used, as the fallback routing mechanism did not account for it. The fix removes this special condition, ensuring that GroupedTopKRouter is consistently used, which correctly handles all configurations, including the (1, 1) case. The resulting code is cleaner and more robust. The significant improvement in accuracy demonstrated in the test results validates this change. The concern regarding Mistral appears to be related to model configurations rather than a direct issue with this code modification, as Mistral models are handled by a separate implementation.

zhewenl · 2026-02-05T07:40:19Z

verified AIME and GSM8K passed, thanks for the fix!

lm_eval --model local-completions \
  --model_args "model=moonshotai/Kimi-K2.5,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=64,timeout=5000,max_length=131072" \
  --tasks gsm8k \
  --num_fewshot 5
  
  |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9416|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9409|±  |0.0065|

lm_eval --model local-chat-completions --model_args "model=moonshotai/Kimi-K2.5,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768" --tasks aime25 --apply_chat_template --gen_kwargs '{"temperature":1.0,"max_gen_toks":72768,"top_p":0.95,"chat_template_kwargs":{"thinking":true}}' --log_samples --output_path "aime25_ds32"
|Tasks |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|------|------:|------|-----:|-----------|---|----:|---|-----:|
|aime25|      0|none  |     0|exact_match|↑  |  0.9|±  |0.0557|

vllm-project#33858) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

[Bugfix] Kimi-K2 grouped_topk usage

be4634b

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

mergify bot added deepseek Related to DeepSeek models bug Something isn't working labels Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

pavanimajety requested a review from ywang96 February 5, 2026 05:46

zhewenl added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 5, 2026

zhewenl requested review from jeejeelee, simon-mo and zhewenl February 5, 2026 07:40

zhewenl approved these changes Feb 5, 2026

View reviewed changes

simon-mo approved these changes Feb 5, 2026

View reviewed changes

simon-mo enabled auto-merge (squash) February 5, 2026 07:46

simon-mo merged commit d2f4a71 into vllm-project:main Feb 5, 2026
57 of 58 checks passed

dbari mentioned this pull request Feb 5, 2026

Fix RoutingMethodType logic #33919

Merged

5 tasks

koush pushed a commit to koush/vllm that referenced this pull request Feb 5, 2026

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (

94167ac

vllm-project#33858) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (

37347fd

vllm-project#33858) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (

5730638

vllm-project#33858) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels.#33858

[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels.#33858
simon-mo merged 1 commit intovllm-project:mainfrom
pavanimajety:fix-kimi-k2p5

pavanimajety commented Feb 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

zhewenl commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pavanimajety commented Feb 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

zhewenl commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pavanimajety commented Feb 5, 2026 •

edited by github-actions bot

Loading