Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
8ce4561
feat: limit thinking tokens
llsj14 Jul 12, 2025
b815e9c
remove comment
llsj14 Jul 12, 2025
2001c36
update states only in update_state method
llsj14 Jul 14, 2025
c71cf86
make precommit and lint
llsj14 Jul 14, 2025
7ae0725
support think start/end as token sequences
llsj14 Jul 16, 2025
03d3495
refactor and change logic faster
llsj14 Jul 17, 2025
5442d0c
rename parameter and logit processor
llsj14 Jul 18, 2025
283a07a
add reasoning effort param
llsj14 Jul 18, 2025
3780d55
remove constraint of the reasoning model
llsj14 Jul 18, 2025
7a509fb
update logit processor
llsj14 Jul 19, 2025
a44e956
pass ruff
llsj14 Jul 19, 2025
0272a72
pass precommit
llsj14 Jul 19, 2025
79c7061
fix format
llsj14 Jul 19, 2025
44f2acb
fix: loads none error
llsj14 Jul 21, 2025
47da378
fix return type
llsj14 Jul 21, 2025
11ac0ef
fix error
llsj14 Jul 21, 2025
7fe7fe4
update ReasoningConfig handling
llsj14 Jul 21, 2025
336efe6
fix config and EngineArgs
llsj14 Jul 21, 2025
4b64abf
simplify reasoning config checks and fix errors
llsj14 Jul 22, 2025
ace7c4f
reafctor ThinkingTokenBudgetLogitsProcessor
llsj14 Jul 27, 2025
43dd440
fix import error from rebase
llsj14 Jul 27, 2025
9ee7f2f
fix: remove duplicate reasoning_effort field in ChatCompletionRequest
llsj14 Aug 16, 2025
117ca92
fix runtime error after rebase
llsj14 Aug 17, 2025
60a275f
check reasoning is enabled
llsj14 Aug 18, 2025
f4afba9
add test and implement processor with incremental token processing op…
llsj14 Aug 19, 2025
9371120
remove connection between reasoning_effort and thinking_token_budget
llsj14 Aug 20, 2025
4b9b87d
fix: support corner cases
llsj14 Aug 23, 2025
93afdf0
cleanup unused parameters
llsj14 Aug 23, 2025
24334b2
optimize speed up performance while apply logit processor
llsj14 Aug 23, 2025
0efea75
utilize logits processor when it is needed, not every step for speed up
llsj14 Sep 4, 2025
81362dc
refactor processor
llsj14 Sep 5, 2025
8312aa8
add comment on state
llsj14 Sep 17, 2025
3b5df9b
fix tokenizer init bug
llsj14 Sep 17, 2025
88fa857
make precommit
llsj14 Sep 17, 2025
998b19a
fix change condition of using tokenizer
llsj14 Sep 18, 2025
3fadb67
make precommit
llsj14 Oct 3, 2025
9a91759
make precommit
llsj14 Oct 3, 2025
899e4a9
fix: support zero thinking token budget
llsj14 Oct 3, 2025
86526fb
refactor: move reasoning token initialization to config level
llsj14 Oct 3, 2025
918ac00
Merge commit '17edd8a' into feat/thinking-budget
llsj14 Oct 12, 2025
18a61b9
ruff
llsj14 Oct 12, 2025
b7ae2c6
Merge commit 'd6953be' into feat/thinking-budget
llsj14 Oct 12, 2025
6b070c0
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Oct 12, 2025
93c310e
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Nov 1, 2025
de53277
make is_thinking_enabled property
llsj14 Nov 1, 2025
c215575
fix readthedocs failed
llsj14 Nov 2, 2025
219ab7b
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Dec 12, 2025
7af86e5
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Dec 18, 2025
0816eb4
Merge branch 'main' into feat/thinking-budget
llsj14 Jan 1, 2026
faa6bfb
Merge branch 'main' into feat/thinking-budget
chaunceyjiang Jan 5, 2026
2a5e6c0
Update vllm/config/reasoning.py
chaunceyjiang Jan 5, 2026
e8c020d
Update vllm/config/reasoning.py
chaunceyjiang Jan 5, 2026
b031c57
Merge branch 'main' into feat/thinking-budget
chaunceyjiang Jan 5, 2026
b600cd0
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Jan 9, 2026
fbaaf12
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Jan 11, 2026
284a398
Merge branch 'main' into feat/thinking-budget
llsj14 Feb 5, 2026
cf34815
Merge branch 'main' into feat/thinking-budget
llsj14 Feb 6, 2026
975a16a
Merge branch 'main' into feat/thinking-budget
llsj14 Feb 8, 2026
d3b06cb
Remove unused import from reasoning.py
hmellor Feb 10, 2026
6563a48
Merge branch 'main' into feat/thinking-budget
hmellor Feb 26, 2026
10f5685
Merge branch 'main' into feat/thinking-budget
llsj14 Feb 27, 2026
be1e8b6
make thinking budget logits processor working with async scheduling o…
llsj14 Feb 27, 2026
5cfa548
make precommit
llsj14 Feb 27, 2026
c035ea0
remove obsolte file
llsj14 Feb 27, 2026
a5d078c
Merge remote-tracking branch 'upstream/main' into feat/thinking-budget
llsj14 Feb 27, 2026
651635c
add docs for thinking budget control
llsj14 Feb 27, 2026
7bd0db0
fix docs
llsj14 Mar 1, 2026
5628941
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 1, 2026
12023dc
do not expose think start end token ids field
llsj14 Mar 3, 2026
f493792
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 3, 2026
4b49c07
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 3, 2026
520a3b8
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 4, 2026
064bbed
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 5, 2026
5db5920
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 5, 2026
7149465
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 6, 2026
e643d5b
make think_start/end_str are required and remove is_thinking_enabled …
llsj14 Mar 8, 2026
cceb341
fix swap part
llsj14 Mar 8, 2026
43ae6c4
fix: ensure reasoning token count exactly matches thinking_token_budget
llsj14 Mar 8, 2026
29bb069
add e2e test
llsj14 Mar 8, 2026
eee2045
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 8, 2026
56ea934
remove gpu util option from e2e test
llsj14 Mar 8, 2026
21288ec
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 9, 2026
00df8fe
make precommit
llsj14 Mar 9, 2026
0fde04f
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 10, 2026
7d7c93a
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 11, 2026
74e5448
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 11, 2026
8d5d70e
use tokenizer encode instead of convert_token_to_ids
llsj14 Mar 13, 2026
45bed67
raise ValueError when thinking_token_budget is set but reasoning_conf…
llsj14 Mar 13, 2026
8252175
make sure that think start/end token ids are derived from string
llsj14 Mar 13, 2026
4624a77
add comment about automation related to ReasoningConfig
llsj14 Mar 13, 2026
02de2da
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 13, 2026
c4f0816
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 16, 2026
a8d512f
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 17, 2026
4661874
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 18, 2026
8131a4b
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 20, 2026
66e7883
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 22, 2026
e5cd2e5
Merge branch 'main' into feat/thinking-budget
llsj14 Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/features/reasoning_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,81 @@ response = client.chat.completions.create(
)
```

## Thinking Budget Control

Some models, such as [Qwen3](https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html#thinking-budget), [DeepSeek](https://www.alibabacloud.com/help/en/model-studio/deep-thinking), and [Nemotron3](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), support a thinking budget that limits the maximum number of tokens used for reasoning.

Token counting starts from `think_start_str`. Once the reasoning token count reaches the configured `thinking_token_budget`, vLLM forces the model to produce `think_end_str`, effectively terminating the reasoning block.

To use this feature:

- `--reasoning-parser` enables reasoning extraction.
- `--reasoning-config` defines the reasoning boundary tokens (e.g., `think_start_str`, `think_end_str`).
- `thinking_token_budget` (a sampling parameter) sets the per-request reasoning token limit.

If `thinking_token_budget` is not specified, no explicit reasoning limit is applied beyond normal generation constraints such as `max_tokens`.

`--reasoning-config` accepts a JSON object corresponding to
[ReasoningConfig][vllm.config.ReasoningConfig] with the following fields:

| Field | Type | Description |
|-------------------|----------------|--------------------------------------------------|
| `think_start_str` | `str \| null` | String that marks the start of reasoning content |
| `think_end_str` | `str \| null` | String that marks the end of reasoning content |

!!! note
`think_end_str` can include a transition phrase before the think end token. For example, setting `think_end_str` to `"I have to give the solution based on the thinking directly now.</think>"` instructs the model to emit that phrase when the budget is exhausted, making the reasoning termination more natural.

### Online Serving

```bash
vllm serve Qwen/Qwen3-0.6B \
--reasoning-parser qwen3 \
--reasoning-config '{"think_start_str": "<think>", "think_end_str": "I have to give the solution based on the thinking directly now.</think>"}'
```

Then make a request with `thinking_token_budget` to limit the reasoning tokens:

```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{ "role": "user", "content": "9.11 and 9.8, which is greater?" }
],
"extra_body": {
"thinking_token_budget": 10
}
}'
```

### Offline Inference

```python
from vllm import LLM, SamplingParams
from vllm.config import ReasoningConfig

llm = LLM(
model="Qwen/Qwen3-0.6B",
reasoning_config=ReasoningConfig(
think_start_str="<think>",
think_end_str="I have to give the solution based on the thinking directly now.</think>",
),
)

sampling_params = SamplingParams(thinking_token_budget=10)

messages = [
{"role": "user", "content": "9.11 and 9.8, which is greater?"},
]

outputs = llm.chat(messages, sampling_params=sampling_params)

for output in outputs:
print("text:", output.outputs[0].text)
```

## Limitations

- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
Expand Down
87 changes: 87 additions & 0 deletions tests/v1/entrypoints/openai/test_thinking_token_budget.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

"""E2E tests for thinking_token_budget with reasoning models."""

import openai
import pytest
import pytest_asyncio

from tests.utils import RemoteOpenAIServer

MODEL_NAME = "Qwen/Qwen3-0.6B"
MESSAGES = [{"role": "user", "content": "What is 1+1? Be concise."}]
THINK_BUDGET = 5


@pytest.fixture(scope="module")
def server():
args = [
"--reasoning-parser",
"qwen3",
"--reasoning-config",
'{"think_start_str": "<think>", "think_end_str": "</think>"}',
"--max-model-len",
"2048",
"--enforce-eager",
"--no-async-scheduling",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an e2e test. (To run, python -m pytest tests/v1/entrypoints/openai/test_thinking_token_budget.py)

Limiting the thinking token budget works with async scheduling, but achieving exact budget enforcement is difficult, because with async scheduling, output token IDs are not updated in sync with each token generation step. I think this issue could also be addressed by the @rishitdholakia13 's following PR (#34668), which aims to enable this feature with speculative decoding. It is also a case where more than one token can be generated per step.

Copy link
Copy Markdown
Contributor

@rishitdholakia13 rishitdholakia13 Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i have addressed the issue in the spec + thinking budget PR and added e2e tests as well that, ensure the exact thinking budget limit is enforced with spec and non spec mode while hsing both sync and async

]
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest_asyncio.fixture
async def client(server):
async with server.get_async_client() as async_client:
yield async_client


@pytest.mark.asyncio
async def test_thinking_token_budget_mixed_requests(client: openai.AsyncOpenAI):
"""Test that mixed requests (some with thinking_token_budget, some without)
complete successfully without errors."""

response_with_budget = await client.chat.completions.create(
model=MODEL_NAME,
messages=MESSAGES,
max_tokens=100,
extra_body={"thinking_token_budget": THINK_BUDGET},
)
response_without_budget = await client.chat.completions.create(
model=MODEL_NAME,
messages=MESSAGES,
max_tokens=100,
)

msg_with = response_with_budget.choices[0].message
msg_without = response_without_budget.choices[0].message

assert msg_with.content or getattr(msg_with, "reasoning", None)
assert msg_without.content or getattr(msg_without, "reasoning", None)


@pytest.mark.asyncio
async def test_thinking_token_budget_limits_reasoning(client: openai.AsyncOpenAI):
"""Test that thinking_token_budget limits the number of reasoning tokens.

In streaming mode each reasoning delta corresponds to one token, so
counting non-empty reasoning_content chunks gives the exact token count.
"""

reasoning_token_count = 0
stream = await client.chat.completions.create(
model=MODEL_NAME,
messages=MESSAGES,
max_tokens=100,
stream=True,
extra_body={"thinking_token_budget": THINK_BUDGET},
)
async for chunk in stream:
delta = chunk.choices[0].delta
if getattr(delta, "reasoning", None):
reasoning_token_count += 1

assert reasoning_token_count == THINK_BUDGET, (
f"reasoning tokens ({reasoning_token_count}) != "
f"thinking_token_budget ({THINK_BUDGET})"
)
Loading
Loading