Skip to content

[Feature]: Add speculative decoding with draft model pruning #24506

@jmamou

Description

@jmamou

🚀 The feature, motivation and pitch

I have implemented the FR-Spec approach
at the logits processor level, using AllowedTokenIdsLogitsProcessor. This implementation does not prune the draft model itself but allows evaluating acceptance rates under different draft pruning ratios. You can find the code here.

MT-Bench results:

pruning ratio vanilla 0.1 0.25 0.5 0.75 0.9 0.99
draft acceptance rate (%) 27.8 28.3 28.6 28.6 27.2 25.9 18.8

New speculative config parameters:

  • token_ids_by_frequency: Path to a tensor file containing token frequencies sorted by token IDs, used for pruning-based speculative decoding.
  • pruning_ratio: Ratio of tokens to prune during speculative decoding.

Example usage

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4, "token_ids_by_frequency": "vllm/examples/target-dist-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt", "pruning_ratio": 0.1}'

vllm bench serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint-type openai-chat \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency 16 \
  --temperature 1 \
  --top-p 1.0

By selectively pruning unlikely tokens in the draft model, this feature is expected to improve speculative decoding speedups while maintaining high acceptance rates, enabling faster inference for large models like LLaMA-3.1-8B-Instruct.

@keyboardAnt @eitanturok
FR-Spec implementation #24343

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions