Skip to content

feat(cpu): add CPU support for draft model speculative decoding#32662

Merged
bigPYJ1151 merged 7 commits intovllm-project:mainfrom
ganeshr10:cpu_support_spec_decode
Apr 10, 2026
Merged

feat(cpu): add CPU support for draft model speculative decoding#32662
bigPYJ1151 merged 7 commits intovllm-project:mainfrom
ganeshr10:cpu_support_spec_decode

Conversation

@ganeshr10
Copy link
Copy Markdown
Contributor

@ganeshr10 ganeshr10 commented Jan 20, 2026

Purpose

This PR enables speculative decoding with draft models on CPU by adding PyTorch fallbacks for Triton kernels.

Benchmark Test

Command:
vllm serve Qwen/Qwen3-32B --dtype=bfloat16 --trust_remote_code --host 0.0.0.0 --port 8008 --max-model-len 20000 --speculative_config '{"model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "method": "draft_model", "dtype": "bfloat16", "max-model-len": 20000}'

vllm bench serve --dataset-name hf --dataset-path philschmid/mt-bench --model Qwen/Qwen3-8B --host 0.0.0.0 --port 8008 --num-prompts 80 --max-concurrency (1/100) --temperature 0.0 --top-p 1.0

Results - Qwen 3 metrics

  • Draft Model : Qwen/Qwen3-1.7B
  • Target Model : Qwen/Qwen3-32B
  • Hardware Used: AMD EPYC 9654 96-Core Processor
image image

Performance Highlights:

  • Equivalent Time-To-First-Token (TTFT) except in high batch sizes.
  • Better Time-Per-Output-Token (TPOT) in SD as compared to Autoregressive decoding even at higher concurrencies.
  • Higher output and total token throughput across all concurrency levels with speculative decoding.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added cpu Related to CPU backends speculative-decoding v1 labels Jan 20, 2026
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces CPU support for speculative decoding with draft models by implementing PyTorch fallbacks for Triton-specific kernels. The changes are well-structured, adding HAS_TRITON checks to conditionally execute either the optimized Triton kernels on CUDA devices or the new PyTorch-based CPU implementations. The CPU model runner is also appropriately updated to handle speculative decoding logic without relying on CUDA-specific features. I've identified a performance optimization opportunity in one of the new PyTorch fallback functions.

Comment thread vllm/v1/sample/rejection_sampler.py Outdated
@ganeshr10 ganeshr10 marked this pull request as ready for review January 22, 2026 04:40
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 22, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread vllm/v1/spec_decode/utils.py Outdated
Comment thread vllm/v1/worker/cpu_model_runner.py Outdated
Comment thread vllm/v1/worker/cpu_model_runner.py Outdated
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 22, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from 3111ae8 to 5a5248f Compare January 22, 2026 11:02
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 22, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in the process of reworking the way we do merge_toks_kernel into a new kernel that can handle parallel drafting. Stay tuned for updates, but in the meantime (few days at most) there's not much value in maintaining this.

@ganeshr10
Copy link
Copy Markdown
Contributor Author

@benchislett I tested your #32887 (Unified Parallel Drafting) on CPU.

I updated my code locally to add CPU fallbacks for the new kernels:

  • Removed merge_toks_kernel (no longer needed with parallel draft)
  • Added PyTorch implementation for copy_and_expand_eagle_inputs_kernel
  • Added PyTorch implementations for eagle_prepare_inputs_padded_kernel and eagle_prepare_next_token_padded_kernel

Test Setup:

  • Draft Model: amd/PARD-Llama-3.2-1B
  • Target Model: unsloth/Meta-Llama-3.1-8B-Instruct
  • Speculative tokens: 8
  • Temperature: 0

I obtained the same acceptance length (3.56) as mentioned in your benchmark comment.

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from 5a5248f to 800cb0e Compare February 10, 2026 05:26
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 10, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from 800cb0e to e5fc9e0 Compare February 10, 2026 09:56
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 10, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch 2 times, most recently from c3f1b62 to 8cef624 Compare February 10, 2026 10:06
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 10, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from 8cef624 to fc1cad0 Compare February 10, 2026 10:32
Comment thread vllm/v1/sample/rejection_sampler.py Outdated
Comment thread vllm/v1/sample/rejection_sampler.py Outdated
is_greedy,
max_spec_len,
)
if HAS_TRITON and device.type == "cuda":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if HAS_TRITON and device.type == "cuda":
if HAS_TRITON:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove device.type == "cuda":

Copy link
Copy Markdown
Contributor Author

@ganeshr10 ganeshr10 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread vllm/v1/sample/rejection_sampler.py Outdated
vocab_size,
NO_DRAFT_PROBS=draft_probs is None,
)
if HAS_TRITON and device.type == "cuda":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if HAS_TRITON and device.type == "cuda":
if HAS_TRITON:

Copy link
Copy Markdown
Contributor Author

@ganeshr10 ganeshr10 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from a3c77c3 to 5946f6f Compare April 1, 2026 05:21
@mergify mergify Bot added ci/build and removed needs-rebase labels Apr 1, 2026
@ganeshr10
Copy link
Copy Markdown
Contributor Author

ganeshr10 commented Apr 1, 2026

@bigPYJ1151 Refactored the code to follow the pattern mentioned in #37987

Copy link
Copy Markdown
Member

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ganeshr10 ! It looks good now. Just have some nits, please check :)

Comment thread vllm/v1/spec_decode/eagle.py Outdated
Comment thread vllm/v1/spec_decode/eagle.py
Comment thread vllm/v1/worker/cpu_model_runner.py Outdated
Comment thread vllm/v1/worker/cpu_model_runner.py Outdated
Comment thread csrc/cpu/spec_decode_utils.cpp Outdated
Comment on lines +1 to +2
#include <torch/extension.h>
#include <omp.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some compilation errors in the build image with the headers. Please use cpu_types.hpp, which already includes required headrs.

@bigPYJ1151 bigPYJ1151 added the verified Run pre-commit for new contributors without triggering other tests label Apr 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 8, 2026

Hi @ganeshr10, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@bigPYJ1151
Copy link
Copy Markdown
Member

@ganeshr10 looks like there are some format issues, please check :)

@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2026
@bigPYJ1151 bigPYJ1151 requested a review from benchislett April 8, 2026 13:52
…uniform parallel drafting

Signed-off-by: R <Ganesh.R@amd.com>
Change-Id: I12fd564ddb73a5a6008f21e9161e52f728d45353
- Use centralized HAS_TRITON from vllm.triton_utils.importing
- Remove redundant device.type == "cuda" checks
- Refactor PyTorch fallbacks to use tensor operations instead of for-loopscd

Signed-off-by: R <Ganesh.R@amd.com>
Change-Id: Ia767bb908b60bde35038c241867f077b08b1ae9a
…etadata

- Add eagle_step_update_slot_mapping_and_metadata_pytorch fallback

Signed-off-by: R <Ganesh.R@amd.com>
Change-Id: I131801d1fda35b990eee1e9b9b228ca54bd56e17
- Move all PyTorch fallback implementations to dedicated file
- Update imports in eagle.py, utils.py, and rejection_sampler.py
- Addresses review comment to separate CPU fallback code

Signed-off-by: R <Ganesh.R@amd.com>

Change-Id: If7197381462b2b39958faab644f23cc42bfa9a5a
- Add C++ implementations with OpenMP for all 8 spec decode kernels
  in csrc/cpu/spec_decode_utils.cpp
- Monkey-patch kernels in CPUModelRunner._postprocess_triton()
- Follows pattern from PR vllm-project#37987 as suggested by @bigPYJ1151

Signed-off-by: R <Ganesh.R@amd.com>
Change-Id: Ia1794e9f04447f23d104a623906cda4ce098468b
Signed-off-by: R <Ganesh.R@amd.com>
Signed-off-by: R <Ganesh.R@amd.com>

Change-Id: I965aac3d579660c5e8b6ee949201654c1fa8ac9c
Signed-off-by: R <Ganesh.R@amd.com>
Change-Id: Ib4918c756fd4d46d52cf24b905a453cea1e2eb63
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 9, 2026

Deprecation notice: This pull request comes from a fork and was rebased using bot_account impersonation. This capability will be removed on July 1, 2026. After this date, the rebase action will no longer be able to rebase fork pull requests with this configuration. Please switch to the update action/command to ensure compatibility going forward.

@ganeshr10 ganeshr10 force-pushed the cpu_support_spec_decode branch from bfc60e3 to 406c20f Compare April 9, 2026 06:25
@bigPYJ1151 bigPYJ1151 dismissed benchislett’s stale review April 10, 2026 03:45

After the refactor this PR implemented SD on CPU via a plugin pattern without heavy change in the triton implementation. And will not increase the maintenance effort of the triton SD. The concern should be resolved.

@bigPYJ1151
Copy link
Copy Markdown
Member

Hi @benchislett I think the new implementation has resolved your concern. I would move the PR forward. Please let me know if you have further thoughts, thanks! :)

@bigPYJ1151 bigPYJ1151 merged commit 445a2a4 into vllm-project:main Apr 10, 2026
65 checks passed
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1 verified Run pre-commit for new contributors without triggering other tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants