Skip to content

Kimi k2.5 MLA based eagle3#36361

Merged
benchislett merged 6 commits intovllm-project:mainfrom
jhaotingc:kimi_k2_eagle3
Mar 11, 2026
Merged

Kimi k2.5 MLA based eagle3#36361
benchislett merged 6 commits intovllm-project:mainfrom
jhaotingc:kimi_k2_eagle3

Conversation

@jhaotingc
Copy link
Copy Markdown
Contributor

@jhaotingc jhaotingc commented Mar 7, 2026

Purpose

@IzzyPutterman is original author.

This allows for Eagles that share MLA instead of GQA for attention, so one can train Eagle3s for Kimi and Deepseek and use them across TRTLLM, SGL, and vLLM.

Test Plan

Acc benchmark:

lm_eval \
  --model local-completions \
  --model_args base_url=http://my_server:8001/v1/completions,model=/trt_llm_ci/data/llm-models/Kimi-K2.5-NVFP4,num_concurrent=16,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --batch_size 16

Test Result

without Eagle3

local-completions ({'base_url': 'http://my_server:8001/v1/completions', 'model': '/trt_llm_ci/data/llm-models/Kimi-K2.5-NVFP4', 'num_concurrent': 16, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 16
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9295|±  |0.0071|
|     |       |strict-match    |     5|exact_match|↑  |0.9295|±  |0.0071|


with Eagle3

local-completions ({'base_url': 'http://my_server:8001/v1/completions', 'model': '/trt_llm_ci/data/llm-models/Kimi-K2.5-NVFP4', 'num_concurrent': 16, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 16
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9265|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.9257|±  |0.0072|


Acceptance:

  --- Weighted averages (by accepted tokens) ---
  Mean acceptance length:   2.785
  Avg draft accept rate:    59.5%
  Per-position accept rate: [0.826, 0.594, 0.365]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added deepseek Related to DeepSeek models new-model Requests to new models speculative-decoding v1 labels Mar 7, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jhaotingc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Eagle3 speculative decoding for Deepseek and Kimi models. The changes include a new deepseek_eagle3.py model implementation, and modifications to deepseek_v2.py to support auxiliary hidden state extraction. The configuration and model registry are also updated. My review focuses on a logical issue in the new deepseek_eagle3.py file where a condition is always false, leading to dead code and a bypassed assertion. I've provided a suggestion to fix this.

Note: Security Review did not run due to the size of the PR.

Comment thread vllm/model_executor/models/deepseek_eagle3.py
Izzy Putterman and others added 4 commits March 10, 2026 09:17
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
@benchislett
Copy link
Copy Markdown
Collaborator

See #35966: why does this PR need a new model implementation and the other one doesn't?

@IzzyPutterman
Copy link
Copy Markdown
Contributor

See #35966: why does this PR need a new model implementation and the other one doesn't?

This one implements the MLA based eagle3 not GQA like the other PR
This is a rebase of my ancient PR: #30574

Comment thread vllm/model_executor/models/deepseek_eagle3.py
@jhaotingc jhaotingc changed the title Kimi k25 eagle3 Kimi k2.5 MLA based eagle3 Mar 10, 2026
@benchislett
Copy link
Copy Markdown
Collaborator

I was able to get this working locally. Seems fine but https://huggingface.co/lightseekorg/kimi-k2.5-eagle3 does not run. Two other EAGLE heads, one internal and one public, work fine. Assuming there's an unrelated config issue with the broken one.

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026
@benchislett benchislett enabled auto-merge (squash) March 10, 2026 21:15
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
auto-merge was automatically disabled March 10, 2026 23:38

Head branch was pushed to by a user without write access

@benchislett benchislett enabled auto-merge (squash) March 10, 2026 23:40
@jhaotingc
Copy link
Copy Markdown
Contributor Author

The failed tests seem unrelated to our change @benchislett

@benchislett
Copy link
Copy Markdown
Collaborator

will rerun failed tests a couple times and update from main once again if that doesn't work. failing that, we'll force-merge tomorrow.

@leihuang-sketch
Copy link
Copy Markdown

@jhaotingc In my tests, I found that there was no improvement in performance, and the throughput remained almost the same without additional parameters for eagle3. Could it be because the output of the gsm8k dataset is too short?

@leihuang-sketch
Copy link
Copy Markdown

LGTM

@benchislett benchislett merged commit 5573894 into vllm-project:main Mar 11, 2026
55 checks passed
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants