Skip to content

[Reasoning] Support for speculative decoding with thinking budget#34668

Open
rishitdholakia13 wants to merge 67 commits intovllm-project:mainfrom
rishitdholakia13:rishitdholakia/thinking-budget-spec
Open

[Reasoning] Support for speculative decoding with thinking budget#34668
rishitdholakia13 wants to merge 67 commits intovllm-project:mainfrom
rishitdholakia13:rishitdholakia/thinking-budget-spec

Conversation

@rishitdholakia13
Copy link
Contributor

@rishitdholakia13 rishitdholakia13 commented Feb 17, 2026

This PR provide support and and compatibility for thinking budget with speculative decoding. This feature is based of #20859 .

The PR extends to making changes ThinkingBudgetLogitsProcess, along with changes to Rejection Sampler and logitsprocessor interface, in order to support, speculative decoding.

This PR provides the following compatibilities :

  • Speculative decoding
  • Non speculative decoding
  • Structured output + Speculative decoding
  • Async
  • Non Async

llsj14 added 30 commits October 3, 2025 02:09
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
…timization

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
@rishitdholakia13 rishitdholakia13 marked this pull request as ready for review February 26, 2026 03:16
@rishitdholakia13 rishitdholakia13 changed the title [Reasoning] [Draft][WIP] Support for speculative decoding with thinking budget [Reasoning] Support for speculative decoding with thinking budget Feb 26, 2026
@mergify
Copy link

mergify bot commented Feb 26, 2026

Hi @rishitdholakia13, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@hmellor
Copy link
Member

hmellor commented Feb 26, 2026

@rishitdholakia13 am I right in saying that this PR depends on #20859 rather than attempts to replace it?

@rishitdholakia13
Copy link
Contributor Author

@rishitdholakia13 am I right in saying that this PR depends on #20859 rather than attempts to replace it?

@hmellor this PR depends on 20859
. Once that is merge, this PR can be merged on top of it.

@mergify
Copy link

mergify bot commented Feb 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rishitdholakia13.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 26, 2026
@keilove
Copy link

keilove commented Mar 6, 2026

Hello, may I ask why this project has been discontinued? Are there any plans to continue its development in the future? thx!

@rishitdholakia13
Copy link
Contributor Author

rishitdholakia13 commented Mar 6, 2026

Hello, may I ask why this project has been discontinued? Are there any plans to continue its development in the future? thx!

Hello, @keilove this has not been discounted, we are waiting for a dependent pr to be merged to start a review on this PR

@njhill
Copy link
Member

njhill commented Mar 12, 2026

I think that rather than trying to make the logit processor interface work with spec decoding, it would be better to change the thinking token budget handling to not be a logits processor. For example like how penalties are currently handled in the v1 case (these are compatible with spec decoding).

@rishitdholakia13
Copy link
Contributor Author

rishitdholakia13 commented Mar 16, 2026

Thank you for the suggestion @njhill. , I have the following pseudo code, that I can use to get it work without the logitsprocessor and similar to the way penalties are being worked with. In that is the case, can we get the #20859 merged. Since that uses the changes required for the thinking budget interface and some of the code thinking budget logic. I am developing on top of the #20859 PR. In the meantime, can I get some suggestions and review on the refactored design without the logitsprocessor.

Thinking Budget Refactor — Design Flow

Move thinking budget out of the logits processor into a ThinkingBudgetStateHolder. Sampler/RejectionSampler call update_state_and_apply after combining outputs with spec tokens (same pattern as penalties).


1. ThinkingBudgetStateHolder

Method When What
sync_batch(batch_update) refresh_metadata Add/remove/move state only. No _update_think_state.
update_state_and_apply(...) Sampler / RejectionSampler Feed token history → _update_think_state → apply to logits. Only place _update_think_state runs.
def sync_batch(self, batch_update):
    for i in batch_update.removed: self._state.pop(i, None)
    for i, params, prompt, out in batch_update.added:
        if params.thinking_token_budget is not None:
            self._state[i] = self._init_state_entry(prompt, params.thinking_token_budget)
    for i1, i2, direction in batch_update.moved: ...

def update_state_and_apply(self, logits, output_token_ids, spec_token_ids, ...):
    for req_idx, state in self._state.items():
        state["output_tok_ids"] = output_token_ids[req_idx]
        self._update_think_state(state)
    # then: build mask/force from state, logits[active, force_ids] = 1e9

2. Sampler

Combine with spec when thinking (or penalties); after penalties, call holder.

if predict_bonus_token and (not no_penalties or not no_thinking_budget or bad_words):
    output_token_ids = self._combine_outputs_with_spec_tokens(output_token_ids, spec_token_ids)
logits = self.apply_penalties(...)
if not no_thinking_budget and thinking_budget_state_holder:
    thinking_budget_state_holder.update_state_and_apply(logits, output_token_ids ..., predict_bonus_token=True)

3. RejectionSampler

Same: combine (expanded list), build repeat_indices, then update_state_and_apply with repeat_indices / num_draft_tokens.

output_token_ids = self._combine_outputs_with_spec_tokens(...)  # expanded
repeat_indices = torch.arange(num_requests).repeat_interleave(num_draft_tokens)
# ... penalties, logitsprocs ...
if not no_thinking_budget and thinking_budget_state_holder:
    thinking_budget_state_holder.update_state_and_apply(..., repeat_indices=repeat_indices, num_draft_tokens=metadata.num_draft_tokens)

4. GPUInputBatch

Create holder when thinking enabled; sync_batch in refresh_metadata; pass holder + static params in metadata.

# __init__: self.thinking_budget_state_holder = ThinkingBudgetStateHolder(...) if thinking_enabled else None
def refresh_metadata(self):
    ...
    if self.thinking_budget_state_holder:
        self.thinking_budget_state_holder.sync_batch(batch_update)
# _make_sampling_metadata: pass no_thinking_budget, thinking_token_budgets, think_start/end_token_ids, thinking_budget_state_holder

5. Cleanup

Remove ThinkingTokenBudgetLogitsProcessor from builtin and from BUILTIN_LOGITS_PROCESSORS / spec-decode branch.


Flow summary

refresh_metadata  →  state_holder.sync_batch(add/remove/move)
_make_sampling_metadata  →  pass holder + static params
Sampler / RejectionSampler  →  _combine_outputs_with_spec_tokens  →  apply_penalties  →  state_holder.update_state_and_apply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants