[Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser#37112
[Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser#37112abhinand5 wants to merge 5 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a reasoning_budget to cap the number of tokens generated within <think>...</think> blocks. The implementation is well-integrated with the existing reasoning parser infrastructure, avoiding new configuration flags as requested in feedback on a previous PR. A new ReasoningBudgetLogitsProcessor is added to handle the token counting and injection of a stop message when the budget is exceeded. The overall approach is sound and the changes to the API surface in ChatCompletionRequest and SamplingParams are appropriate. My main concern is the use of a broad, silent exception handler in the ReasoningBudgetLogitsProcessor's constructor, which could hide configuration or initialization errors.
7eca019 to
d0169e4
Compare
…and sampling parameters Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>
Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Abhinand B <abhinand5899@gmail.com>
Signed-off-by: Abhinand B <abhinand5899@gmail.com>
Purpose
Adds
reasoning_budgetandreasoning_budget_messageparameters to cap reasoning tokens inside<think>...</think>markers for reasoning models (DeepSeek-R1, Qwen3, etc.).Related issues: #34827, #17887, #29791, #15418
Related PR: #20859
Why another PR?
PR #20859 has been open since July 2025 and the main blocker from maintainer @njhill is architectural: it introduces a separate
ReasoningConfigwith--reasoning-configCLI arg requiring users to manually specifythink_start_str/think_end_str. njhill's feedback (Mar 12 review):This PR takes that feedback as a design constraint. It reuses the existing
--reasoning-parserinfrastructure...no new CLI args, no new config classes. The logits processor obtains start/end token IDs directly fromReasoningParserManager, so it works automatically with any registered reasoning parser (deepseek_r1, qwen3, etc.).Design
New builtin
ReasoningBudgetLogitsProcessorfollowing theMinTokensLogitsProcessorpattern:--reasoning-parserviaReasoningParserManager. No parser configured -> processor is a no-op.<think>in prompt), counts reasoning tokens, and switches to injection mode when budget is exceeded."Reasoning budget exceeded, need to answer.") followed by the</think>end token by setting logits to-inffor all tokens except the forced one.</think>appears before the budget is hit, the request is removed from tracking...zero overhead.API surface
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.