-
-
Notifications
You must be signed in to change notification settings - Fork 15.6k
[Feature] limit thinking tokens (hard limit) #20859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+702
−12
Merged
Changes from all commits
Commits
Show all changes
97 commits
Select commit
Hold shift + click to select a range
8ce4561
feat: limit thinking tokens
llsj14 b815e9c
remove comment
llsj14 2001c36
update states only in update_state method
llsj14 c71cf86
make precommit and lint
llsj14 7ae0725
support think start/end as token sequences
llsj14 03d3495
refactor and change logic faster
llsj14 5442d0c
rename parameter and logit processor
llsj14 283a07a
add reasoning effort param
llsj14 3780d55
remove constraint of the reasoning model
llsj14 7a509fb
update logit processor
llsj14 a44e956
pass ruff
llsj14 0272a72
pass precommit
llsj14 79c7061
fix format
llsj14 44f2acb
fix: loads none error
llsj14 47da378
fix return type
llsj14 11ac0ef
fix error
llsj14 7fe7fe4
update ReasoningConfig handling
llsj14 336efe6
fix config and EngineArgs
llsj14 4b64abf
simplify reasoning config checks and fix errors
llsj14 ace7c4f
reafctor ThinkingTokenBudgetLogitsProcessor
llsj14 43dd440
fix import error from rebase
llsj14 9ee7f2f
fix: remove duplicate reasoning_effort field in ChatCompletionRequest
llsj14 117ca92
fix runtime error after rebase
llsj14 60a275f
check reasoning is enabled
llsj14 f4afba9
add test and implement processor with incremental token processing op…
llsj14 9371120
remove connection between reasoning_effort and thinking_token_budget
llsj14 4b9b87d
fix: support corner cases
llsj14 93afdf0
cleanup unused parameters
llsj14 24334b2
optimize speed up performance while apply logit processor
llsj14 0efea75
utilize logits processor when it is needed, not every step for speed up
llsj14 81362dc
refactor processor
llsj14 8312aa8
add comment on state
llsj14 3b5df9b
fix tokenizer init bug
llsj14 88fa857
make precommit
llsj14 998b19a
fix change condition of using tokenizer
llsj14 3fadb67
make precommit
llsj14 9a91759
make precommit
llsj14 899e4a9
fix: support zero thinking token budget
llsj14 86526fb
refactor: move reasoning token initialization to config level
llsj14 918ac00
Merge commit '17edd8a' into feat/thinking-budget
llsj14 18a61b9
ruff
llsj14 b7ae2c6
Merge commit 'd6953be' into feat/thinking-budget
llsj14 6b070c0
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 93c310e
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 de53277
make is_thinking_enabled property
llsj14 c215575
fix readthedocs failed
llsj14 219ab7b
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 7af86e5
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 0816eb4
Merge branch 'main' into feat/thinking-budget
llsj14 faa6bfb
Merge branch 'main' into feat/thinking-budget
chaunceyjiang 2a5e6c0
Update vllm/config/reasoning.py
chaunceyjiang e8c020d
Update vllm/config/reasoning.py
chaunceyjiang b031c57
Merge branch 'main' into feat/thinking-budget
chaunceyjiang b600cd0
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 fbaaf12
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 284a398
Merge branch 'main' into feat/thinking-budget
llsj14 cf34815
Merge branch 'main' into feat/thinking-budget
llsj14 975a16a
Merge branch 'main' into feat/thinking-budget
llsj14 d3b06cb
Remove unused import from reasoning.py
hmellor 6563a48
Merge branch 'main' into feat/thinking-budget
hmellor 10f5685
Merge branch 'main' into feat/thinking-budget
llsj14 be1e8b6
make thinking budget logits processor working with async scheduling o…
llsj14 5cfa548
make precommit
llsj14 c035ea0
remove obsolte file
llsj14 a5d078c
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 651635c
add docs for thinking budget control
llsj14 7bd0db0
fix docs
llsj14 5628941
Merge branch 'main' into feat/thinking-budget
llsj14 12023dc
do not expose think start end token ids field
llsj14 f493792
Merge branch 'main' into feat/thinking-budget
llsj14 4b49c07
Merge branch 'main' into feat/thinking-budget
llsj14 520a3b8
Merge branch 'main' into feat/thinking-budget
llsj14 064bbed
Merge branch 'main' into feat/thinking-budget
llsj14 5db5920
Merge branch 'main' into feat/thinking-budget
llsj14 7149465
Merge branch 'main' into feat/thinking-budget
llsj14 e643d5b
make think_start/end_str are required and remove is_thinking_enabled …
llsj14 cceb341
fix swap part
llsj14 43ae6c4
fix: ensure reasoning token count exactly matches thinking_token_budget
llsj14 29bb069
add e2e test
llsj14 eee2045
Merge branch 'main' into feat/thinking-budget
llsj14 56ea934
remove gpu util option from e2e test
llsj14 21288ec
Merge branch 'main' into feat/thinking-budget
llsj14 00df8fe
make precommit
llsj14 0fde04f
Merge branch 'main' into feat/thinking-budget
llsj14 7d7c93a
Merge branch 'main' into feat/thinking-budget
llsj14 74e5448
Merge branch 'main' into feat/thinking-budget
llsj14 8d5d70e
use tokenizer encode instead of convert_token_to_ids
llsj14 45bed67
raise ValueError when thinking_token_budget is set but reasoning_conf…
llsj14 8252175
make sure that think start/end token ids are derived from string
llsj14 4624a77
add comment about automation related to ReasoningConfig
llsj14 02de2da
Merge branch 'main' into feat/thinking-budget
llsj14 c4f0816
Merge branch 'main' into feat/thinking-budget
llsj14 a8d512f
Merge branch 'main' into feat/thinking-budget
llsj14 4661874
Merge branch 'main' into feat/thinking-budget
llsj14 8131a4b
Merge branch 'main' into feat/thinking-budget
llsj14 66e7883
Merge branch 'main' into feat/thinking-budget
llsj14 e5cd2e5
Merge branch 'main' into feat/thinking-budget
llsj14 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| """E2E tests for thinking_token_budget with reasoning models.""" | ||
|
|
||
| import openai | ||
| import pytest | ||
| import pytest_asyncio | ||
|
|
||
| from tests.utils import RemoteOpenAIServer | ||
|
|
||
| MODEL_NAME = "Qwen/Qwen3-0.6B" | ||
| MESSAGES = [{"role": "user", "content": "What is 1+1? Be concise."}] | ||
| THINK_BUDGET = 5 | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def server(): | ||
| args = [ | ||
| "--reasoning-parser", | ||
| "qwen3", | ||
| "--reasoning-config", | ||
| '{"think_start_str": "<think>", "think_end_str": "</think>"}', | ||
| "--max-model-len", | ||
| "2048", | ||
| "--enforce-eager", | ||
| "--no-async-scheduling", | ||
| ] | ||
| with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: | ||
| yield remote_server | ||
|
|
||
|
|
||
| @pytest_asyncio.fixture | ||
| async def client(server): | ||
| async with server.get_async_client() as async_client: | ||
| yield async_client | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_thinking_token_budget_mixed_requests(client: openai.AsyncOpenAI): | ||
| """Test that mixed requests (some with thinking_token_budget, some without) | ||
| complete successfully without errors.""" | ||
|
|
||
| response_with_budget = await client.chat.completions.create( | ||
| model=MODEL_NAME, | ||
| messages=MESSAGES, | ||
| max_tokens=100, | ||
| extra_body={"thinking_token_budget": THINK_BUDGET}, | ||
| ) | ||
| response_without_budget = await client.chat.completions.create( | ||
| model=MODEL_NAME, | ||
| messages=MESSAGES, | ||
| max_tokens=100, | ||
| ) | ||
|
|
||
| msg_with = response_with_budget.choices[0].message | ||
| msg_without = response_without_budget.choices[0].message | ||
|
|
||
| assert msg_with.content or getattr(msg_with, "reasoning", None) | ||
| assert msg_without.content or getattr(msg_without, "reasoning", None) | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_thinking_token_budget_limits_reasoning(client: openai.AsyncOpenAI): | ||
| """Test that thinking_token_budget limits the number of reasoning tokens. | ||
|
|
||
| In streaming mode each reasoning delta corresponds to one token, so | ||
| counting non-empty reasoning_content chunks gives the exact token count. | ||
| """ | ||
|
|
||
| reasoning_token_count = 0 | ||
| stream = await client.chat.completions.create( | ||
| model=MODEL_NAME, | ||
| messages=MESSAGES, | ||
| max_tokens=100, | ||
| stream=True, | ||
| extra_body={"thinking_token_budget": THINK_BUDGET}, | ||
| ) | ||
| async for chunk in stream: | ||
| delta = chunk.choices[0].delta | ||
| if getattr(delta, "reasoning", None): | ||
| reasoning_token_count += 1 | ||
|
|
||
| assert reasoning_token_count == THINK_BUDGET, ( | ||
| f"reasoning tokens ({reasoning_token_count}) != " | ||
| f"thinking_token_budget ({THINK_BUDGET})" | ||
| ) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an e2e test. (To run,
python -m pytest tests/v1/entrypoints/openai/test_thinking_token_budget.py)Limiting the thinking token budget works with async scheduling, but achieving exact budget enforcement is difficult, because with async scheduling, output token IDs are not updated in sync with each token generation step. I think this issue could also be addressed by the @rishitdholakia13 's following PR (#34668), which aims to enable this feature with speculative decoding. It is also a case where more than one token can be generated per step.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i have addressed the issue in the spec + thinking budget PR and added e2e tests as well that, ensure the exact thinking budget limit is enforced with spec and non spec mode while hsing both sync and async