feat(cpu): add CPU support for draft model speculative decoding#32662
feat(cpu): add CPU support for draft model speculative decoding#32662bigPYJ1151 merged 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces CPU support for speculative decoding with draft models by implementing PyTorch fallbacks for Triton-specific kernels. The changes are well-structured, adding HAS_TRITON checks to conditionally execute either the optimized Triton kernels on CUDA devices or the new PyTorch-based CPU implementations. The CPU model runner is also appropriately updated to handle speculative decoding logic without relying on CUDA-specific features. I've identified a performance optimization opportunity in one of the new PyTorch fallback functions.
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3111ae8 to
5a5248f
Compare
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
benchislett
left a comment
There was a problem hiding this comment.
I am in the process of reworking the way we do merge_toks_kernel into a new kernel that can handle parallel drafting. Stay tuned for updates, but in the meantime (few days at most) there's not much value in maintaining this.
|
@benchislett I tested your #32887 (Unified Parallel Drafting) on CPU. I updated my code locally to add CPU fallbacks for the new kernels:
Test Setup:
I obtained the same acceptance length (3.56) as mentioned in your benchmark comment. |
5a5248f to
800cb0e
Compare
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
800cb0e to
e5fc9e0
Compare
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
c3f1b62 to
8cef624
Compare
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
8cef624 to
fc1cad0
Compare
| is_greedy, | ||
| max_spec_len, | ||
| ) | ||
| if HAS_TRITON and device.type == "cuda": |
There was a problem hiding this comment.
| if HAS_TRITON and device.type == "cuda": | |
| if HAS_TRITON: |
There was a problem hiding this comment.
Please remove device.type == "cuda":
| vocab_size, | ||
| NO_DRAFT_PROBS=draft_probs is None, | ||
| ) | ||
| if HAS_TRITON and device.type == "cuda": |
There was a problem hiding this comment.
| if HAS_TRITON and device.type == "cuda": | |
| if HAS_TRITON: |
a3c77c3 to
5946f6f
Compare
|
@bigPYJ1151 Refactored the code to follow the pattern mentioned in #37987 |
bigPYJ1151
left a comment
There was a problem hiding this comment.
Thanks @ganeshr10 ! It looks good now. Just have some nits, please check :)
| #include <torch/extension.h> | ||
| #include <omp.h> |
There was a problem hiding this comment.
Some compilation errors in the build image with the headers. Please use cpu_types.hpp, which already includes required headrs.
|
Hi @ganeshr10, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@ganeshr10 looks like there are some format issues, please check :) |
…uniform parallel drafting Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I12fd564ddb73a5a6008f21e9161e52f728d45353
- Use centralized HAS_TRITON from vllm.triton_utils.importing - Remove redundant device.type == "cuda" checks - Refactor PyTorch fallbacks to use tensor operations instead of for-loopscd Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ia767bb908b60bde35038c241867f077b08b1ae9a
…etadata - Add eagle_step_update_slot_mapping_and_metadata_pytorch fallback Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I131801d1fda35b990eee1e9b9b228ca54bd56e17
- Move all PyTorch fallback implementations to dedicated file - Update imports in eagle.py, utils.py, and rejection_sampler.py - Addresses review comment to separate CPU fallback code Signed-off-by: R <Ganesh.R@amd.com> Change-Id: If7197381462b2b39958faab644f23cc42bfa9a5a
- Add C++ implementations with OpenMP for all 8 spec decode kernels in csrc/cpu/spec_decode_utils.cpp - Monkey-patch kernels in CPUModelRunner._postprocess_triton() - Follows pattern from PR vllm-project#37987 as suggested by @bigPYJ1151 Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ia1794e9f04447f23d104a623906cda4ce098468b Signed-off-by: R <Ganesh.R@amd.com>
Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I965aac3d579660c5e8b6ee949201654c1fa8ac9c
Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ib4918c756fd4d46d52cf24b905a453cea1e2eb63
|
Deprecation notice: This pull request comes from a fork and was rebased using |
bfc60e3 to
406c20f
Compare
After the refactor this PR implemented SD on CPU via a plugin pattern without heavy change in the triton implementation. And will not increase the maintenance effort of the triton SD. The concern should be resolved.
|
Hi @benchislett I think the new implementation has resolved your concern. I would move the PR forward. Please let me know if you have further thoughts, thanks! :) |
…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>
…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>
…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>
Purpose
This PR enables speculative decoding with draft models on CPU by adding PyTorch fallbacks for Triton kernels.
Benchmark Test
Command:
vllm serve Qwen/Qwen3-32B --dtype=bfloat16 --trust_remote_code --host 0.0.0.0 --port 8008 --max-model-len 20000 --speculative_config '{"model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "method": "draft_model", "dtype": "bfloat16", "max-model-len": 20000}'vllm bench serve --dataset-name hf --dataset-path philschmid/mt-bench --model Qwen/Qwen3-8B --host 0.0.0.0 --port 8008 --num-prompts 80 --max-concurrency (1/100) --temperature 0.0 --top-p 1.0Results - Qwen 3 metrics
Performance Highlights:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.