[Feature] limit thinking tokens (hard limit)#20859
[Feature] limit thinking tokens (hard limit)#20859llsj14 wants to merge 95 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Summary of Changes
Hello @llsj14, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a crucial feature to manage and limit the length of 'thinking' or 'reasoning' phases in large language models that employ explicit reasoning tokens. By allowing users to set a max_think_tokens budget, the system can prevent uncontrolled long reasoning loops, ensuring more predictable and efficient model behavior. The core of this feature is a new logits processor that monitors token generation within designated thinking sections and intervenes to terminate them if the specified limit is exceeded.
Highlights
- New
max_think_tokensparameter: Introduced amax_think_tokensparameter inSamplingParamsand exposed it via the OpenAI protocol'sChatCompletionRequest. This allows users to specify a maximum token limit for the 'thinking' phase of models that utilize explicit reasoning tokens. ReasoningConfigand Dynamic Token ID Management: Added a newReasoningConfigclass tovllm/config.pyto encapsulatethink_start_token_idandthink_end_token_id. These IDs are now dynamically populated inGpuModelRunnerbased on the configured reasoning backend (e.g., DeepSeek R1), ensuring the system correctly identifies and manages reasoning sections.MaxThinkTokensLogitsProcessorImplementation: Implemented a newMaxThinkTokensLogitsProcessorinvllm/v1/sample/logits_processor.py. This processor actively monitors the number of tokens generated within a thinking section. If themax_think_tokenslimit is reached, it modifies the logits to forcibly generate thethink_end_token_id, effectively terminating the reasoning loop.- Enhanced State Tracking for Logits Processors: Modified the
AddedRequesttuple invllm/v1/sample/logits_processor.pyandvllm/v1/worker/gpu_input_batch.pyto includeprompt_tok_ids. This provides logits processors, especially the newMaxThinkTokensLogitsProcessor, with more complete context for tracking token counts from the beginning of a request's generation. - Integration Across the Stack: The new
max_think_tokensparameter and theReasoningConfigare integrated throughout the system, from the API request parsing to theSamplingParams,GpuInputBatch, and finally into theLogitsProcessorManagerto ensure the thinking token limit is enforced during the token generation process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
c13ccf9 to
3a072f0
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to limit the number of "thinking" tokens generated by a model, which is a great way to prevent uncontrolled reasoning loops and manage computational budgets. The implementation adds a max_think_tokens parameter and a corresponding MaxThinkTokensLogitsProcessor to enforce this limit. I've identified a couple of issues related to correctness, particularly in edge cases and state management, which I've detailed below. Addressing these will make the feature more robust.
35cad4f to
4d64881
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
4d64881 to
3c4fc40
Compare
d5b9de1 to
4c4251d
Compare
vllm/v1/sample/logits_processor.py
Outdated
There was a problem hiding this comment.
It seems more appropriate to split this into separate files.
There was a problem hiding this comment.
I think it’s good to separate the files, but I’m just concerned about the divergence of different kinds of logits processors at the moment, since some are declared in the ops directory (e.g., bad words, penalties, top-k, top-p), while the built-in logits processors are declared in this logits_processor.py file.
There was a problem hiding this comment.
You can probably create a logit_processors dir, then put diff logic processor there.
The default ones can just live under logit_processors/__init__.py, and others can have its own file.
There was a problem hiding this comment.
Good, I’ll update it.
There was a problem hiding this comment.
This has already been addressed in this PR, so I will update it as soon as the PR is merged.
There was a problem hiding this comment.
As the PR is merged, I moved my implementation of ThinkingBudgetLogitsProcessors into v1/sample/logits_processor/builtin.py.
vllm/config.py
Outdated
There was a problem hiding this comment.
Let's not introduce another class for this here. I think we can coupled this with the reasoning parser.
There was a problem hiding this comment.
It was quite hard to pass the reasoning parser information to the logits processors. If I don’t use ReasoningConfig, I might still need to pass the reasoning parser object to the logits processor anyways, to make logits processor get the information of think start/end token ids.
aarnphm
left a comment
There was a problem hiding this comment.
quick drive by comments on configuration.
vllm/entrypoints/openai/protocol.py
Outdated
There was a problem hiding this comment.
Can we introduce some heuristic with reasoning_effort. I'm thinking:
- low -> 1024
- medium -> 2048
- high -> 8192
Then we can also have this as additional extra_body for users to override if they have custom context length set to vllm server here.
There was a problem hiding this comment.
Sounds reasonable. So the user should only provide "reasoning_effort": [low, medium, high] as the sampling parameter? What I’m a bit concerned about is that it’s hard to control at the token level, and it’s only configurable when the server loads.
There was a problem hiding this comment.
reasoning_effort are mostly for openai compatible endpoint. If users want more control, we then respect thinking_token_budget or some naming in the body instead of reasoning_effort.
Two scenarios:
- Users who already uses
reasoning_effortfrom openai frontend: nothing changes for them - If they want to increase the thinking budget, knowing that the model context length supports it:
client.chat.completions.create(..., reasoning_effort="medium", # we ignore reasoning_effort here for thinking_tokens_budget extra_body={"thinking_tokens_budget": 16384} )
There was a problem hiding this comment.
also this should be included in the max_tokens calculation as well
There was a problem hiding this comment.
Thank you for your feedbacks. I add this parameter with thinking_tokens_budget.
There was a problem hiding this comment.
I had applied reasoning_effort, but it became a sampling parameter for soft limit of thinking tokens which is used by chat_template.jinja file.
So I broke the connection between reasoning_effort and thinking_budget_tokens.
vllm/config.py
Outdated
There was a problem hiding this comment.
2 things:
- You've put this between
additional_configand the comment above explainint what it is - There's no need to make this config
Optionalyou can default construct the actual config as follows:
| reasoning_config: Optional[ReasoningConfig] = None | |
| reasoning_config: ReasoningConfig = field(default_factory=ReasoningConfig) |
There was a problem hiding this comment.
let's avoid changing this, I don't think this is related to this PR.
There was a problem hiding this comment.
I changed this part, because I needed the start/end token ids from reasoning parser for logits processor, which needs the starting point and the end point of thinking mode.
I referenced this part as reasoning_parser.think_start_token_id for both qwen and deepseek models.
There was a problem hiding this comment.
let's avoid changing this, I don't think this is related to this PR.
+1.
Also, as shown in hunyuan_a13b_reasoning_parser.py, think_start_ids consists of three token IDs. Using reasoning_parser.think_start_token_id directly doesn’t seem like a good approach—I suggest using a @property instead.
There was a problem hiding this comment.
@chaunceyjiang Yes, I’ll update this for extensibility.
For now, I just wanted this PR to support only Qwen and DeepSeek models, which use a single token id to start and finish the thinking mode. I think we’ll need a different workflow for reasoning models that require multiple token ids, for example, they may need partial prefill after forcing multiple tokens at the end. In that case, I’m not sure if using only logits processors is the right approach. Maybe we’ll need partial prefill workflows or some help from guided decoding. What do you think about this?
There was a problem hiding this comment.
I don't think structured outputs is relevant here.
I think frontend-related features should be using logit processor, to avoid performance issue. But the new logit processor should be performant enough.
There was a problem hiding this comment.
It is quite hard to handle multiple think end tokens using a logits processor. That’s why I’m also considering implementing this feature in the serving_chat part, the scheduler, or with guided decoding.
There are several ways to implement this, each with its own drawbacks:
- Logits processor: I would have to enforce multiple think end tokens across multiple decode steps. It means performance degradation. (maybe it sounds still reasonable)
- serving_chat: I could make the
reasoning_parsercount think tokens and enforce think end tokens. This could be quite easy to implement, but with the current implementation, it seems hard to make thereasoning_parsercheck the sampling parameters of every request. It’s challenging to implement this in non Stream API. - Scheduler: Similar to the verification stage of speculative decoding, we could enforce multiple tokens and make the forward step perform a partial prefill. However, it seems quite difficult and complex to make only part of the requests in a batch build a KV cache for multiple tokens. @rishitdholakia13’s implementation appears to follow this approach. but if we need to handle multiple tokens, it would get more complex.
- Guided decoding: Guided decoding or structured outputs have similar needs. for example, forcing certain tokens. But I think it’s also complex to manage given the prior implementations and the use of external libraries.
There was a problem hiding this comment.
I decided to apply multiple think end tokens using logits processors. The methods I described above (options 2–4) are difficult to implement at the moment. So, the logits processors will produce multiple think end tokens across multiple forward steps.
There was a problem hiding this comment.
With this new commit, I made this feature work with start/end tokens defined as token sequences (multiple tokens).
Since the reasoning parsers do not have the same property, I needed a new config argument to get the think start/end strings (e.g., think_end_str="\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n").
There was a problem hiding this comment.
Thank you for the insights, I wanted to ask, what would be the advantage of using a logitsprocessor ? By that I mean, what advantage do we get by forcing the change in the logits value, as compared to just inserting the required token in the new_token_ids list present in gpu_model_runner.py
There was a problem hiding this comment.
Yes, directly inserting tokens into new_token_ids is a valid approach. I used a LogitsProcessor to preserve the sampling flow while still guiding the model to pick the desired token. For multiple end tokens, directly forcing them requires careful KV cache handling, like partial prefill and num_lookahead_slots, which can be tricky. So I designed the LogitsProcessor to force think end tokens one by one. That said, it's also possible to optimize by forcing all at once with partial prefill.
vllm/v1/sample/logits_processor.py
Outdated
There was a problem hiding this comment.
You can probably create a logit_processors dir, then put diff logic processor there.
The default ones can just live under logit_processors/__init__.py, and others can have its own file.
|
fyi #19912 |
vllm/sampling_params.py
Outdated
There was a problem hiding this comment.
Can we rename this to thinking_budget, would help provide consistency in naming since the max thinking here would refer to the thinking budget provided by the user.
There was a problem hiding this comment.
Yeah, that’s possible. But I’m also thinking that maybe in the future min_think_tokens option will be added, which forces the model to generate at least min_think_tokens number of 'think' tokens.
There was a problem hiding this comment.
Similar to your recommendation, I renamed it as thinking_token_budget.
|
This pull request has merge conflicts that must be resolved before it can be |
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces a new feature to limit thinking tokens by sampling parameters, which aims to prevent uncontrolled long reasoning loops and support explicit thinking limits. The code changes include adding a ReasoningConfig class, modifying the SamplingParams class, and implementing a ThinkingTokenBudgetLogitsProcessor class. The code review identified issues related to error handling and redundant conditions, which should be addressed to ensure the code's correctness and maintainability.
| for i1, i2, direction in batch_update.moved: | ||
| if direction == MoveDirectionality.SWAP: | ||
| state1 = self._state.get(i1, {}) | ||
| state2 = self._state.get(i2, {}) |
There was a problem hiding this comment.
I think i found a bug here, where when we run multiple requests with some requests having a no thinking budget, and where there is a swap that happens, the requests that do not have a thinking budget get added to the _state dictionary if it was a part of the swap slot, causing a KeyError. I have made a simple change of using -1 as thinking budget (meaning unlimited thinking in my Spec+ thinking budget PR). This avoids the issue.
There was a problem hiding this comment.
@rishit13 Thank you for pointing this out!
Instead of using -1, I thought it would be better to default to None when popping states. This is consistent with how other logits processors are implemented, and avoids unnecessary overhead from tracking states for requests that don't require a thinking budget. Adding a state entry for every such request would introduce extra overhead in state management.
…method Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
| "--max-model-len", | ||
| "2048", | ||
| "--enforce-eager", | ||
| "--no-async-scheduling", |
There was a problem hiding this comment.
I added an e2e test. (To run, python -m pytest tests/v1/entrypoints/openai/test_thinking_token_budget.py)
Limiting the thinking token budget works with async scheduling, but achieving exact budget enforcement is difficult, because with async scheduling, output token IDs are not updated in sync with each token generation step. I think this issue could also be addressed by the @rishitdholakia13 's following PR (#34668), which aims to enable this feature with speculative decoding. It is also a case where more than one token can be generated per step.
There was a problem hiding this comment.
Yes, i have addressed the issue in the spec + thinking budget PR and added e2e tests as well that, ensure the exact thinking budget limit is enforced with spec and non spec mode while hsing both sync and async
|
Hi @llsj14, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
|
@njhill @chaunceyjiang |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
vllm/v1/worker/gpu_model_runner.py
Outdated
| # ThinkingTokenBudgetLogitsProcessor also needs output token ids to | ||
| # correctly track think start/end token sequences in async scheduling. | ||
| logitsprocs_need_output_token_ids=bool(custom_logitsprocs) | ||
| or (self.vllm_config.reasoning_config is not None), |
There was a problem hiding this comment.
| or (self.vllm_config.reasoning_config is not None), | |
| or self.vllm_config.reasoning_config is not None, |
vllm/config/reasoning.py
Outdated
| self.think_start_token_ids = tokenizer.convert_tokens_to_ids( | ||
| tokenizer.tokenize(self.think_start_str) | ||
| ) |
There was a problem hiding this comment.
Could you explain the reason for replacing
convert_tokens_to_idswith theencodemethod?
performance .. we are making one call into rust without materializing intermediate python strings
Can we use encode with add_special_tokens=False?
vllm/config/reasoning.py
Outdated
| think_start_str: str | None = None | ||
| """String that indicates the start of reasoning.""" | ||
| think_end_str: str | None = None | ||
| """String that indicates the end of reasoning.""" |
There was a problem hiding this comment.
However, after several improvements to the ReasoningParser, some similar interfaces have gradually been introduced internally, although they are not publicly exposed yet.
This is all internal usage right? I don't understand the relevance of exposing publicly. Still confused why we could not have done this.
| `initialize_token_ids`. Not intended to be configured directly.""" | ||
|
|
||
| def initialize_token_ids(self, model_config: ModelConfig) -> None: | ||
| """Initialize reasoning token IDs from strings using the tokenizer.""" |
There was a problem hiding this comment.
I think we need a check here that think_start_token_ids and think_end_token_ids are None.
And we should perhaps rename them to start with an underscore and have @property accessors, as I think we do with other "derived" values in the config classes.
vllm/config/reasoning.py
Outdated
| think_start_str: str | None = None | ||
| """String that indicates the start of reasoning.""" | ||
| think_end_str: str | None = None | ||
| """String that indicates the end of reasoning.""" |
There was a problem hiding this comment.
My main issue here is that we're exposing a new arg / config parameter externally that isn't really required, just because we don't want to go to the hassle of wiring up to the reasoning parsers.
Let's at least add a comment explaining that setting the parameter shouldn't be required and is a temporary state, that the parameter will likely be removed in a subsequent version.
| _bad_words_token_ids: list[list[int]] | None = None | ||
|
|
||
| skip_reading_prefix_cache: bool | None = None | ||
| thinking_token_budget: int | None = None |
There was a problem hiding this comment.
I think we need to add a check in the appropriate place to fail the request if thinking_token_budget is set but reasoning config is None (no logit processor initialized).
There was a problem hiding this comment.
Added a ValueError for that situation.
45bed67
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
…ig is not configured Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
|
@njhill
I added a comment, and I will resolve this issue with separate PR soon. I will ask opinions of you and @chaunceyjiang again there. I think automation can be considered with not only ReasoningParsers but also model configs somehow. |
|
@userbz I also changed tokenizer part. I replaced |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Related Issue:
thinking_budgetfor Qwen-3 Thinking models in offline AsyncLLMEngine? #25714Purpose
reasoning_levelingpt-oss) is only available in certain models. Without such models, controlling thinking tokens in the current vLLM implementation requires making two separate API calls. (Example with the Qwen model.) This is a major pain point because, when making API calls twice, there is no guarantee they will be routed to the same server node unless the system has prefix aware routing.Implementation
thinking_token_budgetsampling parameter, the logits processor will forcibly insert the thinking end token IDs to terminate the thinking section.Test
Integration Test
pytest tests/v1/logits_processors/test_correctness.pyOnline Serving Test
Checking overhead
Model Size: 1.5B (DeepSeek-R1-Distill-Qwen-1.5B)
Max tokens: 512, min_tokens: 512
thinking_token_budgets: 20
The results show that its overhead is almost zero in the median.
thinking_token_budgetswould be larger than 20, as it will serve as a hard limit on top of soft limiting (prompting), which makes the overhead even smaller.(Optional) Documentation Update
Note
Introduces hard “thinking” token limiting and wires it through config, APIs, and sampling.
ReasoningConfigwiththink_start_str/think_end_strand derivedthink_start_token_ids/think_end_token_ids; initializes IDs inVllmConfig.__post_init__and exposes via--reasoning-configCLIthinking_token_budgettoSamplingParamsand OpenAI chat request; plumbs through request constructionThinkingTokenBudgetLogitsProcessorthat tracks per-request state and, once budget is reached after athink_start, forcesthink_endtoken IDs (argmax-variant) while masking othersWritten by Cursor Bugbot for commit f1aefbb022ad5c04af4e88163d979e1517da178c. This will update automatically on new commits. Configure here.
Note
Introduces hard limiting of "thinking" tokens and forces end-of-thinking tokens when the budget is reached.
ThinkingTokenBudgetLogitsProcessortracks per-request state; afterthink_start, counts tokens and, once budget is met, forcesthink_endtoken IDs (registered in logits pipeline)ReasoningConfig(withthink_start_str/think_end_str→ token IDs); initialized inVllmConfig.__post_init__, exported invllm.config, and exposed via--reasoning-configCLI; plumbed throughEngineArgstoVllmConfigSamplingParamsgainsthinking_token_budget; OpenAI chat protocol acceptsthinking_token_budgetand passes it through request constructionWritten by Cursor Bugbot for commit b031c57. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit b600cd0. Configure here.
Note
Introduces hard “thinking” token limiting and wires it through config, APIs, and sampling.
ThinkingTokenBudgetLogitsProcessorcounts tokens afterthink_startand, once budget is reached, forcesthink_endtoken IDs; registered in logits pipeline and state updatesReasoningConfigwiththink_start_str/think_end_str→ token IDs; initialized inVllmConfig.__post_init__and exposed via--reasoning-configCLISamplingParamsand OpenAI chat request acceptthinking_token_budgetand plumb it through request constructionWritten by Cursor Bugbot for commit fbaaf12. This will update automatically on new commits. Configure here.