[feat] add small vocab table for eagle's draft model[1].#3822
[feat] add small vocab table for eagle's draft model[1].#3822zhaochenyang20 merged 21 commits intosgl-project:mainfrom
Conversation
|
Great PR. I will ask our speculative decoding team to review this! |
Thanks! |
I‘ve asked weilin, author of the paper you inplemented. He will take a look. |
ok. But I don't know which paper you mentioned. Can you provide a link? |
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea. |
wow!I will look at this. |
There was a problem hiding this comment.
This operation is time-consuming when running at each iteration.
|
I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Speed performance
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight. CorrectnessI check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. |
|
@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future. |
The top 32k frequency token is uploaded at link. |
good idea. I agree with what you said. I will test the performance in your way after while. |
|
I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state |
hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline. |
That's a good idea. It seems that this PR can focus on the static smaller lm head, leaving the dynamic part to another PR. |
|
Yes, I think it should be divided into two PRs as dynamic method has some extra work to be done. |
a gemm with A, X, and corresponding hot index? |
db612f6 to
21e93c2
Compare
|
please fix the document of speculative decoding: |
working on it |
@Zhou-sx I created a PR to your branch, added a method to download token-map from hf, so that it can pass the docs' ci. |
support download token-map from hf and fix docs
Ok. Thank you for your work. |
|
I will merge it today and fix lint on my own side. |
|
hold on after this #3986 |
|
Nice work!! |
…#3822) Co-authored-by: Achazwl <323163497@qq.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>
Motivation
In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
Todo: further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.
Modifications
Add speculative-token-map in ServerArgs.
Change the the lm-head part in the LlamaEagle Model.
Checklist
How to use
Set
--speculative-token-mapto use this optimization.Or you can obtain high-frequency token by yourself.
Performance
Big model:meta-llama/Llama-3.1-8B-Instruct
Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
Device: 1*H20
I analyzed a segment from profile.
After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.
3.End to End Test:
base:111.51 token/s
after optimization:119.27 token/s
base: 84.49 tokens/s
after optimization:100.24 token/s