diff --git a/docs/features/speculative_decoding/README.md b/docs/features/speculative_decoding/README.md index ee6e0c895d43..9793de3f4c35 100644 --- a/docs/features/speculative_decoding/README.md +++ b/docs/features/speculative_decoding/README.md @@ -6,11 +6,12 @@ To train your own draft models for optimized speculative decoding, see [vllm-pro ## vLLM Speculation Methods -vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, MTP, draft models, and MLP provide the best latency reduction, while simpler methods such as n-gram and suffix decoding provide modest speedups without increasing workload during peak traffic. +vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, MTP, draft models, PARD and MLP provide the best latency reduction, while simpler methods such as n-gram and suffix decoding provide modest speedups without increasing workload during peak traffic. - [EAGLE](eagle.md) - [Multi-Token Prediction (MTP)](mtp.md) - [Draft Model](draft_model.md) +- [Parallel Draft Model (PARD)](parallel_draft_model.md) - [Multi-Layer Perceptron](mlp.md) - [N-Gram](n_gram.md) - [Suffix Decoding](suffix.md) @@ -25,6 +26,7 @@ depend on your model family, traffic pattern, hardware, and sampling settings. | EAGLE | High gain | Medium to high gain | Strong general-purpose model-based method. | | MTP | High gain | Medium to high gain | Best when the target model has native MTP support. | | Draft model | High gain | Medium gain | Needs a separate draft model. | +| Parallel Draft Model | High gain | Medium to high gain | Low draft model latency. | | MLP speculator | Medium to high gain | Medium gain | Good when compatible MLP speculators are available. | | N-gram | Low to medium gain | Medium gain | Lightweight and easy to enable. | | Suffix decoding | Low to medium gain | Medium gain | No extra draft model; dynamic speculation depth. | diff --git a/docs/features/speculative_decoding/parallel_draft_model.md b/docs/features/speculative_decoding/parallel_draft_model.md new file mode 100644 index 000000000000..2a3f11a302d3 --- /dev/null +++ b/docs/features/speculative_decoding/parallel_draft_model.md @@ -0,0 +1,46 @@ +# Parallel Draft Models + +The following code configures vLLM to use speculative decoding where proposals are generated by [PARD](https://arxiv.org/pdf/2504.18583) (Parallel Draft Models). + +## PARD Offline Mode Example + +```python +from vllm import LLM, SamplingParams + +prompts = ["The future of AI is"] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + +llm = LLM( + model="Qwen/Qwen3-8B", + tensor_parallel_size=1, + speculative_config={ + "model": "amd/PARD-Qwen3-0.6B", + "num_speculative_tokens": 12, + "method": "draft_model", + "parallel_drafting": True, + }, +) +outputs = llm.generate(prompts, sampling_params) + +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +``` + +## PARD Online Mode Example + +```bash +vllm serve Qwen/Qwen3-4B \ + --host 0.0.0.0 \ + --port 8000 \ + --seed 42 \ + -tp 1 \ + --max_model_len 2048 \ + --gpu_memory_utilization 0.8 \ + --speculative_config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}' +``` + +## Pre-trained PARD weights + +- [amd/pard](https://huggingface.co/collections/amd/pard) diff --git a/pyproject.toml b/pyproject.toml index cc8f53036723..b786f0d5985e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -175,6 +175,8 @@ tme = "tme" dout = "dout" Pn = "Pn" arange = "arange" +PARD = "PARD" +pard = "pard" [tool.typos.type.py] extend-glob = []