feat(cpu): add CPU support for draft model speculative decoding by ganeshr10 · Pull Request #32662 · vllm-project/vllm

ganeshr10 · 2026-01-20T11:07:26Z

Purpose

This PR enables speculative decoding with draft models on CPU by adding PyTorch fallbacks for Triton kernels.

Benchmark Test

Command:
vllm serve Qwen/Qwen3-32B --dtype=bfloat16 --trust_remote_code --host 0.0.0.0 --port 8008 --max-model-len 20000 --speculative_config '{"model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "method": "draft_model", "dtype": "bfloat16", "max-model-len": 20000}'

vllm bench serve --dataset-name hf --dataset-path philschmid/mt-bench --model Qwen/Qwen3-8B --host 0.0.0.0 --port 8008 --num-prompts 80 --max-concurrency (1/100) --temperature 0.0 --top-p 1.0

Results - Qwen 3 metrics

Draft Model : Qwen/Qwen3-1.7B
Target Model : Qwen/Qwen3-32B
Hardware Used: AMD EPYC 9654 96-Core Processor

Performance Highlights:

Equivalent Time-To-First-Token (TTFT) except in high batch sizes.
Better Time-Per-Output-Token (TPOT) in SD as compared to Autoregressive decoding even at higher concurrencies.
Higher output and total token throughput across all concurrency levels with speculative decoding.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2026-01-20T11:09:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces CPU support for speculative decoding with draft models by implementing PyTorch fallbacks for Triton-specific kernels. The changes are well-structured, adding HAS_TRITON checks to conditionally execute either the optimized Triton kernels on CUDA devices or the new PyTorch-based CPU implementations. The CPU model runner is also appropriately updated to handle speculative decoding logic without relying on CUDA-specific features. I've identified a performance optimization opportunity in one of the new PyTorch fallback functions.

mergify · 2026-01-22T04:40:52Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

mergify · 2026-01-22T09:25:30Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-22T11:06:39Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

benchislett

I am in the process of reworking the way we do merge_toks_kernel into a new kernel that can handle parallel drafting. Stay tuned for updates, but in the meantime (few days at most) there's not much value in maintaining this.

ganeshr10 · 2026-02-04T09:45:57Z

@benchislett I tested your #32887 (Unified Parallel Drafting) on CPU.

I updated my code locally to add CPU fallbacks for the new kernels:

Removed merge_toks_kernel (no longer needed with parallel draft)
Added PyTorch implementation for copy_and_expand_eagle_inputs_kernel
Added PyTorch implementations for eagle_prepare_inputs_padded_kernel and eagle_prepare_next_token_padded_kernel

Test Setup:

Draft Model: amd/PARD-Llama-3.2-1B
Target Model: unsloth/Meta-Llama-3.1-8B-Instruct
Speculative tokens: 8
Temperature: 0

I obtained the same acceptance length (3.56) as mentioned in your benchmark comment.

mergify · 2026-02-10T05:30:46Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-10T09:59:56Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-10T10:11:34Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

xuechendi · 2026-02-25T16:47:39Z

-            is_greedy,
-            max_spec_len,
-        )
+        if HAS_TRITON and device.type == "cuda":


Suggested change

if HAS_TRITON and device.type == "cuda":

if HAS_TRITON:

Please remove device.type == "cuda":

xuechendi · 2026-02-25T16:48:13Z

-        vocab_size,
-        NO_DRAFT_PROBS=draft_probs is None,
-    )
+    if HAS_TRITON and device.type == "cuda":


Suggested change

if HAS_TRITON and device.type == "cuda":

if HAS_TRITON:

ganeshr10 · 2026-04-01T05:24:07Z

@bigPYJ1151 Refactored the code to follow the pattern mentioned in #37987

bigPYJ1151

Thanks @ganeshr10 ! It looks good now. Just have some nits, please check :)

bigPYJ1151 · 2026-04-03T11:51:43Z

+#include <torch/extension.h>
+#include <omp.h>


Some compilation errors in the build image with the headers. Please use cpu_types.hpp, which already includes required headrs.

mergify · 2026-04-08T04:52:38Z

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

bigPYJ1151 · 2026-04-08T06:39:08Z

@ganeshr10 looks like there are some format issues, please check :)

…uniform parallel drafting Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I12fd564ddb73a5a6008f21e9161e52f728d45353

- Use centralized HAS_TRITON from vllm.triton_utils.importing - Remove redundant device.type == "cuda" checks - Refactor PyTorch fallbacks to use tensor operations instead of for-loopscd Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ia767bb908b60bde35038c241867f077b08b1ae9a

…etadata - Add eagle_step_update_slot_mapping_and_metadata_pytorch fallback Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I131801d1fda35b990eee1e9b9b228ca54bd56e17

- Move all PyTorch fallback implementations to dedicated file - Update imports in eagle.py, utils.py, and rejection_sampler.py - Addresses review comment to separate CPU fallback code Signed-off-by: R <Ganesh.R@amd.com> Change-Id: If7197381462b2b39958faab644f23cc42bfa9a5a

@bigPYJ1151

- Add C++ implementations with OpenMP for all 8 spec decode kernels in csrc/cpu/spec_decode_utils.cpp - Monkey-patch kernels in CPUModelRunner._postprocess_triton() - Follows pattern from PR vllm-project#37987 as suggested by @bigPYJ1151 Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ia1794e9f04447f23d104a623906cda4ce098468b Signed-off-by: R <Ganesh.R@amd.com>

Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I965aac3d579660c5e8b6ee949201654c1fa8ac9c

Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ib4918c756fd4d46d52cf24b905a453cea1e2eb63

mergify · 2026-04-09T06:25:03Z

Deprecation notice: This pull request comes from a fork and was rebased using bot_account impersonation. This capability will be removed on July 1, 2026. After this date, the rebase action will no longer be able to rebase fork pull requests with this configuration. Please switch to the update action/command to ensure compatibility going forward.

After the refactor this PR implemented SD on CPU via a plugin pattern without heavy change in the triton implementation. And will not increase the maintenance effort of the triton SD. The concern should be resolved.

bigPYJ1151 · 2026-04-10T03:49:45Z

Hi @benchislett I think the new implementation has resolved your concern. I would move the PR forward. Please let me know if you have further thoughts, thanks! :)

…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>

mergify Bot added cpu Related to CPU backends speculative-decoding v1 labels Jan 20, 2026

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread vllm/v1/sample/rejection_sampler.py Outdated

ganeshr10 marked this pull request as ready for review January 22, 2026 04:40

ganeshr10 requested review from 22quinn, benchislett, bigPYJ1151, houseroad, luccafong and njhill as code owners January 22, 2026 04:40

cursor Bot reviewed Jan 22, 2026

View reviewed changes

Comment thread vllm/v1/spec_decode/utils.py Outdated

Comment thread vllm/v1/worker/cpu_model_runner.py Outdated

Comment thread vllm/v1/worker/cpu_model_runner.py Outdated

ganeshr10 force-pushed the cpu_support_spec_decode branch from 3111ae8 to 5a5248f Compare January 22, 2026 11:02

benchislett requested changes Jan 22, 2026

View reviewed changes

ganeshr10 force-pushed the cpu_support_spec_decode branch from 5a5248f to 800cb0e Compare February 10, 2026 05:26

ganeshr10 force-pushed the cpu_support_spec_decode branch from 800cb0e to e5fc9e0 Compare February 10, 2026 09:56

ganeshr10 force-pushed the cpu_support_spec_decode branch 2 times, most recently from c3f1b62 to 8cef624 Compare February 10, 2026 10:06

ganeshr10 force-pushed the cpu_support_spec_decode branch from 8cef624 to fc1cad0 Compare February 10, 2026 10:32

xuechendi reviewed Feb 25, 2026

View reviewed changes

Comment thread vllm/v1/sample/rejection_sampler.py Outdated

xuechendi reviewed Feb 25, 2026

View reviewed changes

ganeshr10 force-pushed the cpu_support_spec_decode branch from a3c77c3 to 5946f6f Compare April 1, 2026 05:21

mergify Bot added ci/build and removed needs-rebase labels Apr 1, 2026

bigPYJ1151 reviewed Apr 3, 2026

View reviewed changes

bigPYJ1151 added the verified Run pre-commit for new contributors without triggering other tests label Apr 8, 2026

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2026

bigPYJ1151 approved these changes Apr 8, 2026

View reviewed changes

bigPYJ1151 requested a review from benchislett April 8, 2026 13:52

ganeshr10 added 7 commits April 9, 2026 06:25

feat(cpu): add CPU support for draft model speculative decoding with …

7072934

…uniform parallel drafting Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I12fd564ddb73a5a6008f21e9161e52f728d45353

update: add PyTorch fallback for eagle_step_update_slot_mapping_and_m…

40a39c8

…etadata - Add eagle_step_update_slot_mapping_and_metadata_pytorch fallback Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I131801d1fda35b990eee1e9b9b228ca54bd56e17

fix: addressed comments

f04075a

Signed-off-by: R <Ganesh.R@amd.com> Change-Id: I965aac3d579660c5e8b6ee949201654c1fa8ac9c

Format fixes

406c20f

Signed-off-by: R <Ganesh.R@amd.com> Change-Id: Ib4918c756fd4d46d52cf24b905a453cea1e2eb63

ganeshr10 force-pushed the cpu_support_spec_decode branch from bfc60e3 to 406c20f Compare April 9, 2026 06:25

bigPYJ1151 merged commit 445a2a4 into vllm-project:main Apr 10, 2026
65 checks passed

wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026

feat(cpu): add CPU support for draft model speculative decoding (vllm…

880c403

…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>

bigPYJ1151 mentioned this pull request Apr 15, 2026

[CPU] Enable Granite 4 / Mamba models on CPU backend #39157

Draft

5 tasks

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

feat(cpu): add CPU support for draft model speculative decoding (vllm…

9df805a

…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>

jmamou mentioned this pull request May 7, 2026

[CPU] Fix spec decode kernel signatures for synthetic mode compatibility #41932

Merged

4 tasks

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

feat(cpu): add CPU support for draft model speculative decoding (vllm…

89dd2c1

…-project#32662) Signed-off-by: R <Ganesh.R@amd.com>

jmamou mentioned this pull request May 10, 2026

[CPU] Fix rotary embedding for CPU without flash-attn ops #42225

Open

4 tasks

		#include <torch/extension.h>
		#include <omp.h>

Uh oh!

Conversation

ganeshr10 commented Jan 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmark Test

Results - Qwen 3 metrics

Performance Highlights:

Uh oh!

github-actions Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Jan 22, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 22, 2026

Uh oh!

mergify Bot commented Jan 22, 2026

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

ganeshr10 commented Feb 4, 2026

Uh oh!

mergify Bot commented Feb 10, 2026

Uh oh!

mergify Bot commented Feb 10, 2026

Uh oh!

mergify Bot commented Feb 10, 2026

Uh oh!

Uh oh!

xuechendi Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

xuechendi Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshr10 Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshr10 Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganeshr10 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bigPYJ1151 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bigPYJ1151 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

bigPYJ1151 commented Apr 8, 2026

Uh oh!

mergify Bot commented Apr 9, 2026

Uh oh!

bigPYJ1151 commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

ganeshr10 commented Jan 20, 2026 •

edited by github-actions Bot

Loading

ganeshr10 Mar 3, 2026 •

edited

Loading

ganeshr10 Mar 3, 2026 •

edited

Loading

ganeshr10 commented Apr 1, 2026 •

edited

Loading