Skip to content

[feat] add small vocab table for eagle's draft model[1].#3822

Merged
zhaochenyang20 merged 21 commits intosgl-project:mainfrom
Zhou-sx:simplify_lm_head
Mar 3, 2025
Merged

[feat] add small vocab table for eagle's draft model[1].#3822
zhaochenyang20 merged 21 commits intosgl-project:mainfrom
Zhou-sx:simplify_lm_head

Conversation

@Zhou-sx
Copy link
Copy Markdown
Contributor

@Zhou-sx Zhou-sx commented Feb 24, 2025

Motivation

In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
Todo: further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.

Modifications

Add speculative-token-map in ServerArgs.
Change the the lm-head part in the LlamaEagle Model.

Checklist

How to use

Set --speculative-token-map to use this optimization.

Or you can obtain high-frequency token by yourself.

  1. Execute inference on your dataset using sglang's standard inference mode and persist the outputs.
  2. Extract the top-k high-frequency tokens from the saved file. There is a reference implementation (https://gist.github.com/Zhou-sx/71a9196d2f324c93f79016579fdf57da).
#Use small vocab table for draft model
python3 -m sglang.launch_server \
--model-path "meta-llama/Llama-3.1-8B-Instruct" \
--speculative-algorithm "EAGLE" \
--speculative-draft-model-path "lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B" \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--cuda-graph-max-bs 8 \
--dtype "bfloat16" \
--speculative-token-map {hot_token_ids.pt}

Performance

  1. Environment
    Big model:meta-llama/Llama-3.1-8B-Instruct
    Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
    Device: 1*H20
  2. Detailed cost breakdown for each part of Eagle
    I analyzed a segment from profile.
stage part Time (us) stage part Time (us)
Baseline Draft model decode lm_head 482.01 After optimize Draft model decode lm_head 62.40
others 493.00 others 421.03
total 975.01 total 483.43
Draft model extend - 3305.40 Draft model extend - 2433.08
Target model - 14233.15 Target model - 13720.46
others - 1473.88 others - 1594.46
total - 20962.45 total - 18717.86

After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.

3.End to End Test:

  • mtbench:
    base:111.51 token/s
    after optimization:119.27 token/s
  • private dataset(A100):
    base: 84.49 tokens/s
    after optimization:100.24 token/s

@Zhou-sx Zhou-sx marked this pull request as ready for review February 25, 2025 02:44
@Zhou-sx Zhou-sx changed the title [Eagle] small vocab table for draft model. [feat] add small vocab table for eagle's draft model. Feb 26, 2025
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Great PR. I will ask our speculative decoding team to review this!

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

wow!I will look at this.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@Achazwl Hey. We are bumping out eagle codes these days. We will @Zhou-sx on hold this for a while. But the codes looks nice to us. Wait for our update. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is time-consuming when running at each iteration.

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments.
Model: LLama-3-8B-Instruct
Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math).
Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same.
The correctness of the code is verified.

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Feb 26, 2025

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

The top 32k frequency token is uploaded at link.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

good idea. I agree with what you said. I will test the performance in your way after while.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 26, 2025

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

@ZhaiFeiyue
Copy link
Copy Markdown
Contributor

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Feb 28, 2025

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

That's a good idea. It seems that this PR can focus on the static smaller lm head, leaving the dynamic part to another PR.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 28, 2025

Yes, I think it should be divided into two PRs as dynamic method has some extra work to be done.

@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Feb 28, 2025

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

a gemm with A, X, and corresponding hot index?

@Zhou-sx Zhou-sx force-pushed the simplify_lm_head branch from db612f6 to 21e93c2 Compare March 1, 2025 03:57
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Mar 2, 2025

@Achazwl
Copy link
Copy Markdown
Contributor

Achazwl commented Mar 2, 2025

please fix the document of speculative decoding:
https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822
@Zhou-sx @Achazwl

working on it

@Zhou-sx I created a PR to your branch, added a method to download token-map from hf, so that it can pass the docs' ci.

support download token-map from hf and fix docs
@Zhou-sx
Copy link
Copy Markdown
Contributor Author

Zhou-sx commented Mar 2, 2025

please fix the document of speculative decoding:
https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822
@Zhou-sx @Achazwl

working on it

@Zhou-sx I created a PR to your branch, added a method to download token-map from hf, so that it can pass the docs' ci.

Ok. Thank you for your work.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@Zhou-sx @Achazwl please fix the lint plz.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

I will merge it today and fix lint on my own side.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 2, 2025

hold on after this #3986

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@zhyncs @Ying1123 I checked with ying last night. I think we can directly move forward? Am I misundertanding? Do not need to hold.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 2, 2025

@zhyncs @Ying1123 I checked with ying last night. I think we can directly move forward? Am I misundertanding? Do not need to hold.

Ah It's ready to merge once the CIs have passed.

@zhaochenyang20 zhaochenyang20 mentioned this pull request Mar 3, 2025
6 tasks
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

#4002

I fixed the errors here and give credits to @Zhou-sx @Achazwl thanks!

@zhyncs zhyncs reopened this Mar 3, 2025
@zhaochenyang20 zhaochenyang20 merged commit 7fbab73 into sgl-project:main Mar 3, 2025
17 checks passed
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 3, 2025

Nice work!!

@zhaochenyang20 zhaochenyang20 mentioned this pull request Mar 3, 2025
22 tasks
aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025
…#3822)

Co-authored-by: Achazwl <323163497@qq.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants