[feat] add small vocab table for eagle's draft model[1]. by Zhou-sx · Pull Request #3822 · sgl-project/sglang

Zhou-sx · 2025-02-24T15:12:46Z

Motivation

In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
Todo: further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.

Modifications

Add speculative-token-map in ServerArgs.
Change the the lm-head part in the LlamaEagle Model.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

How to use

Set --speculative-token-map to use this optimization.

You can get the high-frequency token in FR-Spec from https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec.

Or you can obtain high-frequency token by yourself.

Execute inference on your dataset using sglang's standard inference mode and persist the outputs.
Extract the top-k high-frequency tokens from the saved file. There is a reference implementation (https://gist.github.com/Zhou-sx/71a9196d2f324c93f79016579fdf57da).

#Use small vocab table for draft model
python3 -m sglang.launch_server \
--model-path "meta-llama/Llama-3.1-8B-Instruct" \
--speculative-algorithm "EAGLE" \
--speculative-draft-model-path "lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B" \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--cuda-graph-max-bs 8 \
--dtype "bfloat16" \
--speculative-token-map {hot_token_ids.pt}

Performance

Environment
Big model:meta-llama/Llama-3.1-8B-Instruct
Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
Device: 1*H20
Detailed cost breakdown for each part of Eagle
I analyzed a segment from profile.

	stage	part	Time (us)		stage	part	Time (us)
Baseline	Draft model decode	lm_head	482.01	After optimize	Draft model decode	lm_head	62.40
		others	493.00			others	421.03
		total	975.01			total	483.43
	Draft model extend	-	3305.40		Draft model extend	-	2433.08
	Target model	-	14233.15		Target model	-	13720.46
	others	-	1473.88		others	-	1594.46
	total	-	20962.45		total	-	18717.86

After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.

3.End to End Test:

mtbench：
base:111.51 token/s
after optimization:119.27 token/s
private dataset(A100)：
base: 84.49 tokens/s
after optimization:100.24 token/s

zhaochenyang20 · 2025-02-26T02:31:41Z

Great PR. I will ask our speculative decoding team to review this!

Zhou-sx · 2025-02-26T02:38:29Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

zhaochenyang20 · 2025-02-26T02:40:28Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

Zhou-sx · 2025-02-26T02:50:28Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

Achazwl · 2025-02-26T02:53:25Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

Zhou-sx · 2025-02-26T02:59:10Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

wow！I will look at this.

zhaochenyang20 · 2025-02-26T05:15:51Z

@Achazwl Hey. We are bumping out eagle codes these days. We will @Zhou-sx on hold this for a while. But the codes looks nice to us. Wait for our update. Thanks!

Achazwl · 2025-02-26T05:42:29Z

python/sglang/srt/speculative/eagle_worker.py

This operation is time-consuming when running at each iteration.

Achazwl · 2025-02-26T06:02:39Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments.
Model: LLama-3-8B-Instruct
Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math).
Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.
eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.

eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same.
The correctness of the code is verified.

Achazwl · 2025-02-26T06:23:28Z

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

Achazwl · 2025-02-26T06:36:29Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.

eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)

eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

The top 32k frequency token is uploaded at link.

Zhou-sx · 2025-02-26T09:41:19Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.

eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)

eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

good idea. I agree with what you said. I will test the performance in your way after while.

Zhou-sx · 2025-02-26T09:59:13Z

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

ZhaiFeiyue · 2025-02-27T10:31:51Z

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

Achazwl · 2025-02-28T08:25:21Z

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

That's a good idea. It seems that this PR can focus on the static smaller lm head, leaving the dynamic part to another PR.

Zhou-sx · 2025-02-28T10:31:04Z

Yes, I think it should be divided into two PRs as dynamic method has some extra work to be done.

Zhou-sx · 2025-02-28T10:32:03Z

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

hi @Achazwl you are right, the current dynamic updates is not well optimized(lm_head is memory bound, copy weights will hurt the performance), the best way is to create a cuda kernel that do lm_head gemm with zero-copy. @Zhou-sx we can have a talk offline.

a gemm with A, X, and corresponding hot index？

zhaochenyang20 · 2025-03-02T05:00:39Z

please fix the document of speculative decoding:

https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822

@Zhou-sx @Achazwl

Achazwl · 2025-03-02T07:09:32Z

please fix the document of speculative decoding:

https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822

@Zhou-sx @Achazwl

working on it

Achazwl · 2025-03-02T09:31:26Z

please fix the document of speculative decoding:
https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822
@Zhou-sx @Achazwl

working on it

@Zhou-sx I created a PR to your branch, added a method to download token-map from hf, so that it can pass the docs' ci.

support download token-map from hf and fix docs

Zhou-sx · 2025-03-02T10:39:27Z

please fix the document of speculative decoding:
https://github.com/sgl-project/sglang/actions/runs/13611969961/job/38050464032?pr=3822
@Zhou-sx @Achazwl

working on it

@Zhou-sx I created a PR to your branch, added a method to download token-map from hf, so that it can pass the docs' ci.

Ok. Thank you for your work.

zhaochenyang20 · 2025-03-02T18:41:58Z

@Zhou-sx @Achazwl please fix the lint plz.

zhaochenyang20 · 2025-03-02T19:52:35Z

I will merge it today and fix lint on my own side.

zhaochenyang20 · 2025-03-02T21:04:38Z

https://github.com/sgl-project/sglang/actions/runs/13617742017/job/38066324134?pr=3822

@Zhou-sx @Achazwl This should be fixed

zhyncs · 2025-03-02T21:32:25Z

hold on after this #3986

zhaochenyang20 · 2025-03-02T21:47:24Z

@zhyncs @Ying1123 I checked with ying last night. I think we can directly move forward? Am I misundertanding? Do not need to hold.

zhyncs · 2025-03-02T22:07:16Z

@zhyncs @Ying1123 I checked with ying last night. I think we can directly move forward? Am I misundertanding? Do not need to hold.

Ah It's ready to merge once the CIs have passed.

zhaochenyang20 · 2025-03-03T00:33:27Z

#4002

I fixed the errors here and give credits to @Zhou-sx @Achazwl thanks!

…plify_lm_head

zhyncs · 2025-03-03T03:15:53Z

Nice work!!

…#3822) Co-authored-by: Achazwl <323163497@qq.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

Zhou-sx marked this pull request as ready for review February 25, 2025 02:44

Zhou-sx requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners February 25, 2025 02:44

Zhou-sx changed the title ~~[Eagle] small vocab table for draft model.~~ [feat] add small vocab table for eagle's draft model. Feb 26, 2025

Achazwl reviewed Feb 26, 2025

View reviewed changes

Zhou-sx force-pushed the simplify_lm_head branch from db612f6 to 21e93c2 Compare March 1, 2025 03:57

Zhou-sx added 4 commits March 1, 2025 04:02

opt eagle lm_head

707ddff

opt eagle dynamic lm_head

372222f

code formatting

2cc3e2b

revert dynamic lm_head

e5b2bfa

Achazwl added 3 commits March 2, 2025 16:18

support downloading speculative-token-map from hf

111ca86

fix doc

1a5aad8

fix doc

cd9b2a7

Merge pull request #3 from Achazwl/simplify_lm_head

016d820

support download token-map from hf and fix docs

Merge branch 'main' into simplify_lm_head

4faf14a

zhaochenyang20 mentioned this pull request Mar 3, 2025

Simplify lm head in eagle #4002

Closed

6 tasks

zhaochenyang20 closed this Mar 3, 2025

zhyncs reopened this Mar 3, 2025

zhaochenyang20 added 3 commits March 3, 2025 00:48

Merge branch 'simplify_lm_head' of github.com:Zhou-sx/sglang into sim…

4e93386

…plify_lm_head

fix lint, docs and hot key ids

d887668

Merge branch 'main' into simplify_lm_head

b421309

zhaochenyang20 merged commit 7fbab73 into sgl-project:main Mar 3, 2025
17 checks passed

Achazwl mentioned this pull request Mar 3, 2025

[Feature] Add test for speculative_token_map #4016

Merged

6 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

[feat] add small vocab table for eagle's draft model[1]. (sgl-project…

c0af289

…#3822) Co-authored-by: Achazwl <323163497@qq.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

gemini-code-assist bot mentioned this pull request Jun 5, 2025

[Bugfix] add small vocab table for eagle qwen2 #6903

Closed

This was referenced Jul 26, 2025

[question] appropriate benchmarks #8391

Closed

[Bug] Frequency-Ranked Speculative Sampling #8581

Closed

Conversation

Zhou-sx commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

How to use

Performance

Uh oh!

zhaochenyang20 commented Feb 26, 2025

Uh oh!

Zhou-sx commented Feb 26, 2025

Uh oh!

zhaochenyang20 commented Feb 26, 2025

Uh oh!

Zhou-sx commented Feb 26, 2025

Uh oh!

Achazwl commented Feb 26, 2025

Uh oh!

Zhou-sx commented Feb 26, 2025

Uh oh!

zhaochenyang20 commented Feb 26, 2025

Uh oh!

Achazwl Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Achazwl commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed performance

Correctness

Uh oh!

Achazwl commented Feb 26, 2025

Uh oh!

Achazwl commented Feb 26, 2025

Speed performance

Correctness

Uh oh!

Zhou-sx commented Feb 26, 2025

Speed performance

Correctness

Uh oh!

Zhou-sx commented Feb 26, 2025

Uh oh!

ZhaiFeiyue commented Feb 27, 2025

Uh oh!

Achazwl commented Feb 28, 2025

Uh oh!

Zhou-sx commented Feb 28, 2025

Uh oh!

Zhou-sx commented Feb 28, 2025

Uh oh!

zhaochenyang20 commented Mar 2, 2025

Uh oh!

Achazwl commented Mar 2, 2025

Uh oh!

Achazwl commented Mar 2, 2025

Uh oh!

Zhou-sx commented Mar 2, 2025

Uh oh!

zhaochenyang20 commented Mar 2, 2025

Uh oh!

zhaochenyang20 commented Mar 2, 2025

Uh oh!

zhaochenyang20 commented Mar 2, 2025

Uh oh!

zhyncs commented Mar 2, 2025

Uh oh!

zhaochenyang20 commented Mar 2, 2025

Uh oh!

zhyncs commented Mar 2, 2025

Uh oh!

zhaochenyang20 commented Mar 3, 2025

Uh oh!

Uh oh!

zhyncs commented Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Zhou-sx commented Feb 24, 2025 •

edited

Loading

Achazwl commented Feb 26, 2025 •

edited

Loading