Skip to content

[feat] Kimi K2/DeepSeek Support eagle3#35966

Closed
leihuang-sketch wants to merge 3 commits intovllm-project:mainfrom
novitalabs:feature/kimi-deepseek-eagle3-support
Closed

[feat] Kimi K2/DeepSeek Support eagle3#35966
leihuang-sketch wants to merge 3 commits intovllm-project:mainfrom
novitalabs:feature/kimi-deepseek-eagle3-support

Conversation

@leihuang-sketch
Copy link
Copy Markdown

@leihuang-sketch leihuang-sketch commented Mar 4, 2026

Hi from novita.ai team 👋

Purpose

improve throughput

Test Plan

Test Result

I just used the draft model

serve args:

vllm serve /to/path/Kimi-K2.5/ \
  --speculative_config '{"model": "/models/Kimi-K25-eagle3", "num_speculative_tokens": 4, "method": "eagle3", "draft_tensor_parallel_size": 8}' \
  --port 3333 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice \
  --served-model-name moonshotai/kimi-k2.5 

test with gsm8k

python tests/evals/gsm8k/gsm8k_eval.py \

From metrics, the acceptance rate is 51.78%.

curl http://127.0.0.1:3333/metrics | grep spec_decode
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 58595  100 58595    0     0  5684k      0 --:--:# HELP vllm:spec_decode_num_drafts_total Number of spec decoding drafts.
-# TYPE vllm:spec_decode_num_drafts_total counter
-vllm:spec_decode_num_drafts_total{engine="0",model_name="moonshotai/kimi-k2.5"} 42668.0
 # HELP vllm:spec_decode_num_drafts_created Number of spec decoding drafts.
-# TYPE vllm:spec_decode_num_drafts_created gauge
-vllm:spec_decode_num_drafts_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168053e+09
:# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
-# TYPE vllm:spec_decode_num_draft_tokens_total counter
-vllm:spec_decode_num_draft_tokens_total{engine="0",model_name="moonshotai/kimi-k2.5"} 170672.0
:# HELP vllm:spec_decode_num_draft_tokens_created Number of draft tokens.
-# TYPE vllm:spec_decode_num_draft_tokens_created gauge
-vllm:spec_decode_num_draft_tokens_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168284e+09
 # HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
-# TYPE vllm:spec_decode_num_accepted_tokens_total counter
-vllm:spec_decode_num_accepted_tokens_total{engine="0",model_name="moonshotai/kimi-k2.5"} 88381.0
:# HELP vllm:spec_decode_num_accepted_tokens_created Number of accepted tokens.
-# TYPE vllm:spec_decode_num_accepted_tokens_created gauge
-vllm:spec_decode_num_accepted_tokens_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168435e+09
:# HELP vllm:spec_decode_num_accepted_tokens_per_pos_total Accepted tokens per draft position.
-# TYPE vllm:spec_decode_num_accepted_tokens_per_pos_total counter
-vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="0"} 32894.0
 vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="1"} 24858.0
6vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="2"} 17916.0
3vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="3"} 12713.0
57# HELP vllm:spec_decode_num_accepted_tokens_per_pos_created Accepted tokens per draft position.
k# TYPE vllm:spec_decode_num_accepted_tokens_per_pos_created gauge

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 4, 2026

Hi @leihuang-sketch, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Eagle3 speculative decoding support for DeepSeek V2 and Kimi K2.5 models. The changes involve implementing the SupportsEagle3 interface, adding logic to extract auxiliary hidden states from specified layers, and plumbing this through the Kimi model. I've found a couple of critical issues related to pipeline parallelism in the deepseek_v2.py implementation that need to be addressed. The proposed changes to the layer iteration and index calculation should resolve these issues.

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
@benchislett
Copy link
Copy Markdown
Collaborator

Way over-commented. Please trim down the unnecessary comments.

See also #36063 where I am refactoring how we do this, which should cut down on the amount of boilerplate needed to enable support here.

Simplify docstrings and remove redundant comments that duplicate
what the code already expresses. Keep only essential technical notes
that explain non-obvious implementation details.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@leihuang-sketch
Copy link
Copy Markdown
Author

@benchislett Thanks for the feedback! I've trimmed down the comments significantly.

#36063 looks great – looking forward to seeing it merged so I can refactor accordingly.

@oreo-wjx
Copy link
Copy Markdown

oreo-wjx commented Mar 10, 2026

Did you notice the accept rate you measured is way lower than the reported 2.746/3 ? @leihuang-sketch

I personally ran some tests, using num_speculative_tokens=3 for example. The accept rate seems reasonable(~85%) when I use Kimi K2 + K2 eagle3 draft model,once I changed the checkpoints into K2.5 + K2.5 eagle3 draft model,it downs to ~50%. There shouldn't be any difference here, still wondering...

cc @benchislett

@oreo-wjx
Copy link
Copy Markdown

Did you notice the accept rate you measured is way lower than the reported 2.746/3 ? @leihuang-sketch

I personally ran some test, using num_speculative_tokens=3 for example. The accept rate seems reasonable(~85%) when I use Kimi K2 + K2 eagle3 draft model,once I changed the checkpoints into K2.5 + K2.5 eagle3 draft model,it downs to ~50%. The shouldn't be any difference here, still wondering...

cc @benchislett

The accept rate on sglang is correct though, since the released K2.5 eagle3 draft model is measured on sglang.

@benchislett
Copy link
Copy Markdown
Collaborator

@leihuang-sketch can you try to reproduce the issue seen by @oreo-wjx ?

@benchislett
Copy link
Copy Markdown
Collaborator

See also #36361

@benchislett benchislett mentioned this pull request Mar 10, 2026
5 tasks
@benchislett
Copy link
Copy Markdown
Collaborator

Closing in favor of #36361. I was able to get both a custom EAGLE3 head as well as AQ-MedAI/Kimi-K25-eagle3 working.

However, https://huggingface.co/lightseekorg/kimi-k2.5-eagle3 did not work. I will assume there is a configuration issue with that model and move forward with the other PR.

@oreo-wjx
Copy link
Copy Markdown

@leihuang-sketch can you try to reproduce the issue seen by @oreo-wjx ?

I think I've somehow figured it out. The acceptance seems correct because the acc_len in SGLang includes the bonus token. So the actual accept rate for this Eagle3 draft model is ~1.746/3 for GSM8K.

@leihuang-sketch leihuang-sketch deleted the feature/kimi-deepseek-eagle3-support branch March 16, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants