[Reasoning] Support for speculative decoding with thinking budget by rishitdholakia13 · Pull Request #34668 · vllm-project/vllm

rishitdholakia13 · 2026-02-17T03:32:39Z

This PR provide support and and compatibility for thinking budget with speculative decoding. This feature is based of #20859 .

The PR extends to making changes ThinkingBudgetLogitsProcess, along with changes to Rejection Sampler and logitsprocessor interface, in order to support, speculative decoding.

This PR provides the following compatibilities :

Speculative decoding
Non speculative decoding
Structured output + Speculative decoding
Async
Non Async

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

…timization Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

mergify · 2026-02-26T03:23:47Z

Hi @rishitdholakia13, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

hmellor · 2026-02-26T14:26:53Z

@rishitdholakia13 am I right in saying that this PR depends on #20859 rather than attempts to replace it?

rishitdholakia13 · 2026-02-26T14:37:15Z

@rishitdholakia13 am I right in saying that this PR depends on #20859 rather than attempts to replace it?

@hmellor this PR depends on 20859
. Once that is merge, this PR can be merged on top of it.

mergify · 2026-02-26T17:45:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rishitdholakia13.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

keilove · 2026-03-06T07:49:12Z

Hello, may I ask why this project has been discontinued? Are there any plans to continue its development in the future? thx！

rishitdholakia13 · 2026-03-06T13:08:51Z

Hello, may I ask why this project has been discontinued? Are there any plans to continue its development in the future? thx！

Hello, @keilove this has not been discounted, we are waiting for a dependent pr to be merged to start a review on this PR

njhill · 2026-03-12T20:24:29Z

I think that rather than trying to make the logit processor interface work with spec decoding, it would be better to change the thinking token budget handling to not be a logits processor. For example like how penalties are currently handled in the v1 case (these are compatible with spec decoding).

rishitdholakia13 · 2026-03-16T11:24:50Z

Thank you for the suggestion @njhill. , I have the following pseudo code, that I can use to get it work without the logitsprocessor and similar to the way penalties are being worked with. In that is the case, can we get the #20859 merged. Since that uses the changes required for the thinking budget interface and some of the code thinking budget logic. I am developing on top of the #20859 PR. In the meantime, can I get some suggestions and review on the refactored design without the logitsprocessor.

Thinking Budget Refactor — Design Flow

Move thinking budget out of the logits processor into a ThinkingBudgetStateHolder. Sampler/RejectionSampler call update_state_and_apply after combining outputs with spec tokens (same pattern as penalties).

1. ThinkingBudgetStateHolder

Method	When	What
`sync_batch(batch_update)`	`refresh_metadata`	Add/remove/move state only. No `_update_think_state`.
`update_state_and_apply(...)`	Sampler / RejectionSampler	Feed token history → `_update_think_state` → apply to logits. Only place _update_think_state runs.

def sync_batch(self, batch_update):
    for i in batch_update.removed: self._state.pop(i, None)
    for i, params, prompt, out in batch_update.added:
        if params.thinking_token_budget is not None:
            self._state[i] = self._init_state_entry(prompt, params.thinking_token_budget)
    for i1, i2, direction in batch_update.moved: ...

def update_state_and_apply(self, logits, output_token_ids, spec_token_ids, ...):
    for req_idx, state in self._state.items():
        state["output_tok_ids"] = output_token_ids[req_idx]
        self._update_think_state(state)
    # then: build mask/force from state, logits[active, force_ids] = 1e9

2. Sampler

Combine with spec when thinking (or penalties); after penalties, call holder.

if predict_bonus_token and (not no_penalties or not no_thinking_budget or bad_words):
    output_token_ids = self._combine_outputs_with_spec_tokens(output_token_ids, spec_token_ids)
logits = self.apply_penalties(...)
if not no_thinking_budget and thinking_budget_state_holder:
    thinking_budget_state_holder.update_state_and_apply(logits, output_token_ids ..., predict_bonus_token=True)

3. RejectionSampler

Same: combine (expanded list), build repeat_indices, then update_state_and_apply with repeat_indices / num_draft_tokens.

output_token_ids = self._combine_outputs_with_spec_tokens(...)  # expanded
repeat_indices = torch.arange(num_requests).repeat_interleave(num_draft_tokens)
# ... penalties, logitsprocs ...
if not no_thinking_budget and thinking_budget_state_holder:
    thinking_budget_state_holder.update_state_and_apply(..., repeat_indices=repeat_indices, num_draft_tokens=metadata.num_draft_tokens)

4. GPUInputBatch

Create holder when thinking enabled; sync_batch in refresh_metadata; pass holder + static params in metadata.

# __init__: self.thinking_budget_state_holder = ThinkingBudgetStateHolder(...) if thinking_enabled else None
def refresh_metadata(self):
    ...
    if self.thinking_budget_state_holder:
        self.thinking_budget_state_holder.sync_batch(batch_update)
# _make_sampling_metadata: pass no_thinking_budget, thinking_token_budgets, think_start/end_token_ids, thinking_budget_state_holder

5. Cleanup

Remove ThinkingTokenBudgetLogitsProcessor from builtin and from BUILTIN_LOGITS_PROCESSORS / spec-decode branch.

Flow summary

refresh_metadata  →  state_holder.sync_batch(add/remove/move)
_make_sampling_metadata  →  pass holder + static params
Sampler / RejectionSampler  →  _combine_outputs_with_spec_tokens  →  apply_penalties  →  state_holder.update_state_and_apply

llsj14 added 30 commits October 3, 2025 02:09

feat: limit thinking tokens

8ce4561

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

remove comment

b815e9c

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

update states only in update_state method

2001c36

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make precommit and lint

c71cf86

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

support think start/end as token sequences

7ae0725

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

refactor and change logic faster

03d3495

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

rename parameter and logit processor

5442d0c

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

add reasoning effort param

283a07a

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

remove constraint of the reasoning model

3780d55

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

update logit processor

7a509fb

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

pass ruff

a44e956

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

pass precommit

0272a72

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix format

79c7061

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix: loads none error

44f2acb

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix return type

47da378

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix error

11ac0ef

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

update ReasoningConfig handling

7fe7fe4

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix config and EngineArgs

336efe6

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

simplify reasoning config checks and fix errors

4b64abf

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

reafctor ThinkingTokenBudgetLogitsProcessor

ace7c4f

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix import error from rebase

43dd440

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix: remove duplicate reasoning_effort field in ChatCompletionRequest

9ee7f2f

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix runtime error after rebase

117ca92

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

check reasoning is enabled

60a275f

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

add test and implement processor with incremental token processing op…

f4afba9

…timization Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

remove connection between reasoning_effort and thinking_token_budget

9371120

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

fix: support corner cases

4b9b87d

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

cleanup unused parameters

93afdf0

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

optimize speed up performance while apply logit processor

24334b2

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

utilize logits processor when it is needed, not every step for speed up

0efea75

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>

Add bonust token changes

7f90045

rishitdholakia13 marked this pull request as ready for review February 26, 2026 03:16

rishitdholakia13 requested review from 22quinn, DarkLight1337, NickLucche, ProExpertProg, WoosukKwon, aarnphm, chaunceyjiang, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, russellb, tlrmchlsmth, yewentao256 and youkaichao as code owners February 26, 2026 03:16

rishitdholakia13 changed the title ~~[Reasoning] [Draft][WIP] Support for speculative decoding with thinking budget~~ [Reasoning] Support for speculative decoding with thinking budget Feb 26, 2026

Update the end logic

921a050

mergify bot added the needs-rebase label Feb 26, 2026

llsj14 mentioned this pull request Mar 8, 2026

[Feature] limit thinking tokens (hard limit) #20859

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reasoning] Support for speculative decoding with thinking budget#34668

[Reasoning] Support for speculative decoding with thinking budget#34668
rishitdholakia13 wants to merge 67 commits intovllm-project:mainfrom
rishitdholakia13:rishitdholakia/thinking-budget-spec

rishitdholakia13 commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

hmellor commented Feb 26, 2026

Uh oh!

rishitdholakia13 commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

keilove commented Mar 6, 2026

Uh oh!

rishitdholakia13 commented Mar 6, 2026 •

edited

Loading

Uh oh!

njhill commented Mar 12, 2026

Uh oh!

rishitdholakia13 commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

rishitdholakia13 commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

hmellor commented Feb 26, 2026

Uh oh!

rishitdholakia13 commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

keilove commented Mar 6, 2026

Uh oh!

rishitdholakia13 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Mar 12, 2026

Uh oh!

rishitdholakia13 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thinking Budget Refactor — Design Flow

1. ThinkingBudgetStateHolder

2. Sampler

3. RejectionSampler

4. GPUInputBatch

5. Cleanup

Flow summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rishitdholakia13 commented Feb 17, 2026 •

edited by github-actions bot

Loading

rishitdholakia13 commented Mar 6, 2026 •

edited

Loading

rishitdholakia13 commented Mar 16, 2026 •

edited

Loading