-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Open
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
I have implemented the FR-Spec approach
at the logits processor level, using AllowedTokenIdsLogitsProcessor. This implementation does not prune the draft model itself but allows evaluating acceptance rates under different draft pruning ratios. You can find the code here.
MT-Bench results:
| pruning ratio | vanilla | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 0.99 |
|---|---|---|---|---|---|---|---|
| draft acceptance rate (%) | 27.8 | 28.3 | 28.6 | 28.6 | 27.2 | 25.9 | 18.8 |
New speculative config parameters:
- token_ids_by_frequency: Path to a tensor file containing token frequencies sorted by token IDs, used for pruning-based speculative decoding.
- pruning_ratio: Ratio of tokens to prune during speculative decoding.
Example usage
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4, "token_ids_by_frequency": "vllm/examples/target-dist-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt", "pruning_ratio": 0.1}'
vllm bench serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint-type openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path philschmid/mt-bench \
--num-prompts 80 \
--max-concurrency 16 \
--temperature 1 \
--top-p 1.0
By selectively pruning unlikely tokens in the draft model, this feature is expected to improve speculative decoding speedups while maintaining high acceptance rates, enabling faster inference for large models like LLaMA-3.1-8B-Instruct.
@keyboardAnt @eitanturok
FR-Spec implementation #24343
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
keyboardAntkeyboardAntkeyboardAnt
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request