Kimi k2.5 MLA based eagle3#36361
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request adds support for Eagle3 speculative decoding for Deepseek and Kimi models. The changes include a new deepseek_eagle3.py model implementation, and modifications to deepseek_v2.py to support auxiliary hidden state extraction. The configuration and model registry are also updated. My review focuses on a logical issue in the new deepseek_eagle3.py file where a condition is always false, leading to dead code and a bypassed assertion. I've provided a suggestion to fix this.
Note: Security Review did not run due to the size of the PR.
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
6be5049 to
9be34d6
Compare
|
See #35966: why does this PR need a new model implementation and the other one doesn't? |
|
I was able to get this working locally. Seems fine but https://huggingface.co/lightseekorg/kimi-k2.5-eagle3 does not run. Two other EAGLE heads, one internal and one public, work fine. Assuming there's an unrelated config issue with the broken one. |
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Head branch was pushed to by a user without write access
|
The failed tests seem unrelated to our change @benchislett |
|
will rerun failed tests a couple times and update from main once again if that doesn't work. failing that, we'll force-merge tomorrow. |
|
@jhaotingc In my tests, I found that there was no improvement in performance, and the throughput remained almost the same without additional parameters for eagle3. Could it be because the output of the gsm8k dataset is too short? |
|
LGTM |
Signed-off-by: Izzy Putterman <iputterman@nvidia.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Izzy Putterman <iputterman@nvidia.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Izzy Putterman <iputterman@nvidia.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
Purpose
@IzzyPutterman is original author.
This allows for Eagles that share MLA instead of GQA for attention, so one can train Eagle3s for Kimi and Deepseek and use them across TRTLLM, SGL, and vLLM.
Test Plan
Acc benchmark:
Test Result
without Eagle3
with Eagle3
Acceptance:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.