[Reasoning] Support for speculative decoding with thinking budget#34668
[Reasoning] Support for speculative decoding with thinking budget#34668rishitdholakia13 wants to merge 67 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
…timization Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
|
Hi @rishitdholakia13, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@rishitdholakia13 am I right in saying that this PR depends on #20859 rather than attempts to replace it? |
@hmellor this PR depends on 20859 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hello, may I ask why this project has been discontinued? Are there any plans to continue its development in the future? thx! |
Hello, @keilove this has not been discounted, we are waiting for a dependent pr to be merged to start a review on this PR |
|
I think that rather than trying to make the logit processor interface work with spec decoding, it would be better to change the thinking token budget handling to not be a logits processor. For example like how penalties are currently handled in the v1 case (these are compatible with spec decoding). |
|
Thank you for the suggestion @njhill. , I have the following pseudo code, that I can use to get it work without the logitsprocessor and similar to the way penalties are being worked with. In that is the case, can we get the #20859 merged. Since that uses the changes required for the thinking budget interface and some of the code thinking budget logic. I am developing on top of the #20859 PR. In the meantime, can I get some suggestions and review on the refactored design without the logitsprocessor. Thinking Budget Refactor — Design FlowMove thinking budget out of the logits processor into a ThinkingBudgetStateHolder. Sampler/RejectionSampler call update_state_and_apply after combining outputs with spec tokens (same pattern as penalties). 1. ThinkingBudgetStateHolder
def sync_batch(self, batch_update):
for i in batch_update.removed: self._state.pop(i, None)
for i, params, prompt, out in batch_update.added:
if params.thinking_token_budget is not None:
self._state[i] = self._init_state_entry(prompt, params.thinking_token_budget)
for i1, i2, direction in batch_update.moved: ...
def update_state_and_apply(self, logits, output_token_ids, spec_token_ids, ...):
for req_idx, state in self._state.items():
state["output_tok_ids"] = output_token_ids[req_idx]
self._update_think_state(state)
# then: build mask/force from state, logits[active, force_ids] = 1e92. SamplerCombine with spec when thinking (or penalties); after penalties, call holder. if predict_bonus_token and (not no_penalties or not no_thinking_budget or bad_words):
output_token_ids = self._combine_outputs_with_spec_tokens(output_token_ids, spec_token_ids)
logits = self.apply_penalties(...)
if not no_thinking_budget and thinking_budget_state_holder:
thinking_budget_state_holder.update_state_and_apply(logits, output_token_ids ..., predict_bonus_token=True)3. RejectionSamplerSame: combine (expanded list), build output_token_ids = self._combine_outputs_with_spec_tokens(...) # expanded
repeat_indices = torch.arange(num_requests).repeat_interleave(num_draft_tokens)
# ... penalties, logitsprocs ...
if not no_thinking_budget and thinking_budget_state_holder:
thinking_budget_state_holder.update_state_and_apply(..., repeat_indices=repeat_indices, num_draft_tokens=metadata.num_draft_tokens)4. GPUInputBatchCreate holder when thinking enabled; sync_batch in refresh_metadata; pass holder + static params in metadata. # __init__: self.thinking_budget_state_holder = ThinkingBudgetStateHolder(...) if thinking_enabled else None
def refresh_metadata(self):
...
if self.thinking_budget_state_holder:
self.thinking_budget_state_holder.sync_batch(batch_update)
# _make_sampling_metadata: pass no_thinking_budget, thinking_token_budgets, think_start/end_token_ids, thinking_budget_state_holder5. CleanupRemove Flow summary |
This PR provide support and and compatibility for thinking budget with speculative decoding. This feature is based of #20859 .
The PR extends to making changes ThinkingBudgetLogitsProcess, along with changes to Rejection Sampler and logitsprocessor interface, in order to support, speculative decoding.
This PR provides the following compatibilities :