EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916

ekagra-ranjan · 2025-09-30T01:19:03Z

Fixes #20780

E3 has 20% better AL than E1 but the e2e TOPS was just 4%. The expectation was to see atleast 20% better e2e gains.
The issue was traced to data in benchmark having a very small difference which lead to this huge difference. Offline inference gave the AL of 2.79 for E3 on MTBench. However, the AL reported in Online (after hacking it to give the overall AL and not snapshot of AL by bypassing the reset of the prometheus metric) was ~2.2. This happened because both offline and online inference share the same dataset but offline sets add_special_tokens to False whereas online was setting it to True for llama 3.1 model based on model config.

This meant the prompt being used in online serving had <|begin_of_text|> twice in the beginning, one from the chat template and one from the tokenizer.encode with add_special_tokens as True. This very small difference was enough to throw the E3 off balance and we see such sharp drop in AL. This sudden drop is not seen in E1 which is why this discrepancy in data bw online and offline inference was never discovered during E1 ablations.

The right way is to skip chat template in the dataset builder and use /v1/chat/completitions endpoint as done below.

cmd:
server
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001 --speculative_config '{"method": "eagle3","model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}'

client
vllm bench serve --port 9001 --save-result --save-detailed --model meta-llama/Llama-3.1-8B-Instruct --temperature=0.0 --top-p=1.0 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 4 --result-dir "./throwaway" --endpoint "/v1/chat/completions" --backend openai-chat --skip-chat-template

TPOT on MTBench BS4 on H100

E1: 4.32ms
E3 before this PR: 4.09ms (5.6% faster than E1)
E3 with this PR: 3.25ms (32.9% faster than E1)

gemini-code-assist

Code Review

This pull request introduces a --trim-special-tokens flag to handle cases where special tokens are added twice, which was impacting benchmark performance. The core logic is in the new normalize function. While the intent is correct, the implementation has critical flaws: it includes an incorrect assertion that will crash with some tokenizers and it doesn't fully implement the described suffix trimming. I've provided a refactored implementation to address these correctness issues, making the solution more robust.

vllm/benchmarks/datasets.py

benchislett

Would a simpler solution here be to simply use the chat completions API for backends that have a conversations input? Is this possible in the vllm benchmarking framework?

ekagra-ranjan · 2025-09-30T03:10:51Z

vllm benchmark only supports /v1/completitions as of now. I can take a look if it can be extended to /v1/chat/completitions to avoid these changes.

vllm/benchmarks/datasets.py

Signed-off-by: Ekagra Ranjan <[email protected]>

ekagra-ranjan · 2025-09-30T16:01:36Z

@benchislett - updated to use /v1/chat/completitions API correctly.

vllm/benchmarks/datasets.py

benchislett

I much prefer this feature. I am in favor of going even further and making the chat completions endpoint the default behaviour, but that might have some consequences. Anyone else have thoughts on this?

Signed-off-by: Ekagra Ranjan <[email protected]>

ekagra-ranjan · 2025-09-30T23:13:55Z

@benchislett - can you add the ready tag and auto-merge on the PR so that CI can run?

ekagra-ranjan · 2025-10-01T15:01:26Z

v1/spec_decode/test_eagle.py:373: is failing here with
FAILED v1/spec_decode/test_eagle.py::test_load_model[True-1-FLASH_ATTN-eagle] - RuntimeError: generator raised StopIteration. This PR doesnt touch that code so probably not related to this PR.

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

…2% instead of 5% on MTBench (#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: Karan Goel <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

mergify bot added the performance Performance-related issues label Sep 30, 2025

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

vllm/benchmarks/datasets.py Outdated Show resolved Hide resolved

ekagra-ranjan changed the title ~~fix data~~ EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench Sep 30, 2025

benchislett reviewed Sep 30, 2025

View reviewed changes

vllm/benchmarks/datasets.py Outdated Show resolved Hide resolved

fix chat template

1730b3b

Signed-off-by: Ekagra Ranjan <[email protected]>

ekagra-ranjan force-pushed the er-e3-opt branch from 3e04b41 to 1730b3b Compare September 30, 2025 15:56

ekagra-ranjan mentioned this pull request Sep 30, 2025

[Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE #17812

Open

ekagra-ranjan requested a review from benchislett September 30, 2025 16:02

benchislett reviewed Sep 30, 2025

View reviewed changes

vllm/benchmarks/datasets.py Outdated Show resolved Hide resolved

benchislett approved these changes Sep 30, 2025

View reviewed changes

remove

d42c69a

Signed-off-by: Ekagra Ranjan <[email protected]>

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025

benchislett enabled auto-merge (squash) September 30, 2025 23:53

simon-mo disabled auto-merge October 2, 2025 18:29

simon-mo merged commit 1cab2f9 into vllm-project:main Oct 2, 2025
47 of 49 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

6e67ce2

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

jasonlizhengjian mentioned this pull request Oct 3, 2025

[NVIDIA] flashinfer TRTLLM attention prefill token limit #25998

Merged

5 tasks

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

da3a188

…2% instead of 5% on MTBench (#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: yewentao256 <[email protected]>

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

675e3be

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

ef736ed

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

cdaa9c4

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

61dfef2

…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916

Uh oh!

ekagra-ranjan commented Sep 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

benchislett left a comment

Uh oh!

ekagra-ranjan commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 30, 2025

Uh oh!

Uh oh!

benchislett left a comment

Uh oh!

ekagra-ranjan commented Sep 30, 2025 •

edited

Loading

Uh oh!

ekagra-ranjan commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916

Uh oh!

Conversation

ekagra-ranjan commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 30, 2025

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ekagra-ranjan commented Sep 30, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan commented Sep 30, 2025 •

edited

Loading

ekagra-ranjan commented Sep 30, 2025 •

edited

Loading

ekagra-ranjan commented Oct 1, 2025 •

edited

Loading