[feat] Kimi K2/DeepSeek Support eagle3 by leihuang-sketch · Pull Request #35966 · vllm-project/vllm

leihuang-sketch · 2026-03-04T06:26:51Z

Hi from novita.ai team 👋

Purpose

improve throughput

Test Plan

Test Result

serve args:

vllm serve /to/path/Kimi-K2.5/ \
  --speculative_config '{"model": "/models/Kimi-K25-eagle3", "num_speculative_tokens": 4, "method": "eagle3", "draft_tensor_parallel_size": 8}' \
  --port 3333 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice \
  --served-model-name moonshotai/kimi-k2.5

test with gsm8k

python tests/evals/gsm8k/gsm8k_eval.py \

From metrics, the acceptance rate is 51.78%.

curl http://127.0.0.1:3333/metrics | grep spec_decode
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 58595  100 58595    0     0  5684k      0 --:--:# HELP vllm:spec_decode_num_drafts_total Number of spec decoding drafts.
-# TYPE vllm:spec_decode_num_drafts_total counter
-vllm:spec_decode_num_drafts_total{engine="0",model_name="moonshotai/kimi-k2.5"} 42668.0
 # HELP vllm:spec_decode_num_drafts_created Number of spec decoding drafts.
-# TYPE vllm:spec_decode_num_drafts_created gauge
-vllm:spec_decode_num_drafts_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168053e+09
:# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
-# TYPE vllm:spec_decode_num_draft_tokens_total counter
-vllm:spec_decode_num_draft_tokens_total{engine="0",model_name="moonshotai/kimi-k2.5"} 170672.0
:# HELP vllm:spec_decode_num_draft_tokens_created Number of draft tokens.
-# TYPE vllm:spec_decode_num_draft_tokens_created gauge
-vllm:spec_decode_num_draft_tokens_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168284e+09
 # HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
-# TYPE vllm:spec_decode_num_accepted_tokens_total counter
-vllm:spec_decode_num_accepted_tokens_total{engine="0",model_name="moonshotai/kimi-k2.5"} 88381.0
:# HELP vllm:spec_decode_num_accepted_tokens_created Number of accepted tokens.
-# TYPE vllm:spec_decode_num_accepted_tokens_created gauge
-vllm:spec_decode_num_accepted_tokens_created{engine="0",model_name="moonshotai/kimi-k2.5"} 1.7730446216168435e+09
:# HELP vllm:spec_decode_num_accepted_tokens_per_pos_total Accepted tokens per draft position.
-# TYPE vllm:spec_decode_num_accepted_tokens_per_pos_total counter
-vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="0"} 32894.0
 vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="1"} 24858.0
6vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="2"} 17916.0
3vllm:spec_decode_num_accepted_tokens_per_pos_total{engine="0",model_name="moonshotai/kimi-k2.5",position="3"} 12713.0
57# HELP vllm:spec_decode_num_accepted_tokens_per_pos_created Accepted tokens per draft position.
k# TYPE vllm:spec_decode_num_accepted_tokens_per_pos_created gauge

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-03-04T06:31:59Z

Hi @leihuang-sketch, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

gemini-code-assist

Code Review

This pull request adds Eagle3 speculative decoding support for DeepSeek V2 and Kimi K2.5 models. The changes involve implementing the SupportsEagle3 interface, adding logic to extract auxiliary hidden states from specified layers, and plumbing this through the Kimi model. I've found a couple of critical issues related to pipeline parallelism in the deepseek_v2.py implementation that need to be addressed. The proposed changes to the layer iteration and index calculation should resolve these issues.

benchislett · 2026-03-05T15:59:54Z

Way over-commented. Please trim down the unnecessary comments.

See also #36063 where I am refactoring how we do this, which should cut down on the amount of boilerplate needed to enable support here.

Simplify docstrings and remove redundant comments that duplicate what the code already expresses. Keep only essential technical notes that explain non-obvious implementation details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

leihuang-sketch · 2026-03-06T02:54:13Z

@benchislett Thanks for the feedback! I've trimmed down the comments significantly.

#36063 looks great – looking forward to seeing it merged so I can refactor accordingly.

oreo-wjx · 2026-03-10T07:47:59Z

Did you notice the accept rate you measured is way lower than the reported 2.746/3 ? @leihuang-sketch

I personally ran some tests, using num_speculative_tokens=3 for example. The accept rate seems reasonable(~85%) when I use Kimi K2 + K2 eagle3 draft model，once I changed the checkpoints into K2.5 + K2.5 eagle3 draft model，it downs to ~50%. There shouldn't be any difference here, still wondering...

cc @benchislett

oreo-wjx · 2026-03-10T07:51:48Z

Did you notice the accept rate you measured is way lower than the reported 2.746/3 ? @leihuang-sketch

I personally ran some test, using num_speculative_tokens=3 for example. The accept rate seems reasonable(~85%) when I use Kimi K2 + K2 eagle3 draft model，once I changed the checkpoints into K2.5 + K2.5 eagle3 draft model，it downs to ~50%. The shouldn't be any difference here, still wondering...

cc @benchislett

The accept rate on sglang is correct though, since the released K2.5 eagle3 draft model is measured on sglang.

benchislett · 2026-03-10T16:30:49Z

@leihuang-sketch can you try to reproduce the issue seen by @oreo-wjx ?

benchislett · 2026-03-10T16:30:56Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Kimi K2/DeepSeek Support eagle3#35966

[feat] Kimi K2/DeepSeek Support eagle3#35966
leihuang-sketch wants to merge 3 commits intovllm-project:mainfrom
novitalabs:feature/kimi-deepseek-eagle3-support

leihuang-sketch commented Mar 4, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

benchislett commented Mar 5, 2026

Uh oh!

leihuang-sketch commented Mar 6, 2026

Uh oh!

oreo-wjx commented Mar 10, 2026 •

edited

Loading

Uh oh!

oreo-wjx commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

oreo-wjx commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

leihuang-sketch commented Mar 4, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

benchislett commented Mar 5, 2026

Uh oh!

leihuang-sketch commented Mar 6, 2026

Uh oh!

oreo-wjx commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oreo-wjx commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

benchislett commented Mar 10, 2026

Uh oh!

oreo-wjx commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leihuang-sketch commented Mar 4, 2026 •

edited by github-actions Bot

Loading

oreo-wjx commented Mar 10, 2026 •

edited

Loading