-
-
Notifications
You must be signed in to change notification settings - Fork 11k
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench #25916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a --trim-special-tokens flag to handle cases where special tokens are added twice, which was impacting benchmark performance. The core logic is in the new normalize function. While the intent is correct, the implementation has critical flaws: it includes an incorrect assertion that will crash with some tokenizers and it doesn't fully implement the described suffix trimming. I've provided a refactored implementation to address these correctness issues, making the solution more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a simpler solution here be to simply use the chat completions API for backends that have a conversations input? Is this possible in the vllm benchmarking framework?
|
vllm benchmark only supports /v1/completitions as of now. I can take a look if it can be extended to /v1/chat/completitions to avoid these changes. |
Signed-off-by: Ekagra Ranjan <[email protected]>
3e04b41 to
1730b3b
Compare
|
@benchislett - updated to use /v1/chat/completitions API correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I much prefer this feature. I am in favor of going even further and making the chat completions endpoint the default behaviour, but that might have some consequences. Anyone else have thoughts on this?
Signed-off-by: Ekagra Ranjan <[email protected]>
|
@benchislett - can you add the ready tag and auto-merge on the PR so that CI can run? |
|
|
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>
…2% instead of 5% on MTBench (#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: Karan Goel <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]>
…2% instead of 5% on MTBench (vllm-project#25916) Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Fixes #20780
E3 has 20% better AL than E1 but the e2e TOPS was just 4%. The expectation was to see atleast 20% better e2e gains.
The issue was traced to data in benchmark having a very small difference which lead to this huge difference. Offline inference gave the AL of 2.79 for E3 on MTBench. However, the AL reported in Online (after hacking it to give the overall AL and not snapshot of AL by bypassing the reset of the prometheus metric) was ~2.2. This happened because both offline and online inference share the same dataset but offline sets
add_special_tokensto False whereas online was setting it to True for llama 3.1 model based on model config.This meant the prompt being used in online serving had
<|begin_of_text|>twice in the beginning, one from the chat template and one from the tokenizer.encode withadd_special_tokensas True. This very small difference was enough to throw the E3 off balance and we see such sharp drop in AL. This sudden drop is not seen in E1 which is why this discrepancy in data bw online and offline inference was never discovered during E1 ablations.The right way is to skip chat template in the dataset builder and use
/v1/chat/completitionsendpoint as done below.cmd:
server
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001 --speculative_config '{"method": "eagle3","model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}'client
vllm bench serve --port 9001 --save-result --save-detailed --model meta-llama/Llama-3.1-8B-Instruct --temperature=0.0 --top-p=1.0 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 4 --result-dir "./throwaway" --endpoint "/v1/chat/completions" --backend openai-chat --skip-chat-templateTPOT on MTBench BS4 on H100