[Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser by abhinand5 · Pull Request #37112 · vllm-project/vllm

abhinand5 · 2026-03-15T17:36:40Z

Purpose

Adds reasoning_budget and reasoning_budget_message parameters to cap reasoning tokens inside <think>...</think> markers for reasoning models (DeepSeek-R1, Qwen3, etc.).

Related issues: #34827, #17887, #29791, #15418
Related PR: #20859

Why another PR?

PR #20859 has been open since July 2025 and the main blocker from maintainer @njhill is architectural: it introduces a separate ReasoningConfig with --reasoning-config CLI arg requiring users to manually specify think_start_str/think_end_str. njhill's feedback (Mar 12 review):

"My main issue here is that we're exposing a new arg / config parameter externally that isn't really required, just because we don't want to go to the hassle of wiring up to the reasoning parsers."

"From UX pov, I feel it's much better/important to have these be determined automatically."

This PR takes that feedback as a design constraint. It reuses the existing --reasoning-parser infrastructure...no new CLI args, no new config classes. The logits processor obtains start/end token IDs directly from ReasoningParserManager, so it works automatically with any registered reasoning parser (deepseek_r1, qwen3, etc.).

Design

New builtin ReasoningBudgetLogitsProcessor following the MinTokensLogitsProcessor pattern:

Constructor: Gets start/end token IDs from the configured --reasoning-parser via ReasoningParserManager. No parser configured -> processor is a no-op.
Per-request state: Tracks nesting depth (handles <think> in prompt), counts reasoning tokens, and switches to injection mode when budget is exceeded.
Injection: Force-injects a configurable message (default: "Reasoning budget exceeded, need to answer.") followed by the </think> end token by setting logits to
-inf for all tokens except the forced one.
Natural end: If </think> appears before the budget is hit, the request is removed from tracking...zero overhead.

API surface

# OpenAI-compatible chat completion
{
    "reasoning_budget": 50,                    # max reasoning tokens (optional)
    "reasoning_budget_message": "\n\nLet me stop thinking and answer now."  # custom injection message (optional)
}

# SamplingParams
SamplingParams(reasoning_budget=50, reasoning_budget_message="Wrap it up.")

Test Plan

# Serve a reasoning model
vllm serve Qwen/Qwen3.5-4B --reasoning-parser qwen3

# With budget (reasoning capped at ~50 tokens)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-4B",
    "messages": [{"role": "user", "content": "Prove there are infinitely many primes."}],
    "max_tokens": 1024,
    "reasoning_budget": 50
  }'

# Without budget (baseline, unchanged behavior)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-4B",
    "messages": [{"role": "user", "content": "Prove there are infinitely many primes."}],
    "max_tokens": 1024
  }'

Test Result

Budget=512: Reasoning capped at ~50 tokens, injection message + appears, model transitions to coherent answer
Budget=512 with custom message: Works with "reasoning_budget_message": "Stop thinking and answer now."
Natural end before budget: Model finishes reasoning in <512 tokens --> no injection, behaves normally
No budget set: Identical behavior to baseline (processor is no-op)
No reasoning parser configured: reasoning_budget silently ignored (processor has no start/end token IDs)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2026-03-15T17:36:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a reasoning_budget to cap the number of tokens generated within <think>...</think> blocks. The implementation is well-integrated with the existing reasoning parser infrastructure, avoiding new configuration flags as requested in feedback on a previous PR. A new ReasoningBudgetLogitsProcessor is added to handle the token counting and injection of a stop message when the budget is exceeded. The overall approach is sound and the changes to the API surface in ChatCompletionRequest and SamplingParams are appropriate. My main concern is the use of a broad, silent exception handler in the ReasoningBudgetLogitsProcessor's constructor, which could hide configuration or initialization errors.

vllm/v1/sample/logits_processor/builtin.py

…and sampling parameters Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>

Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Abhinand B <abhinand5899@gmail.com>

Signed-off-by: Abhinand B <abhinand5899@gmail.com>

mergify bot added frontend v1 labels Mar 15, 2026

gemini-code-assist bot reviewed Mar 15, 2026

View reviewed changes

vllm/v1/sample/logits_processor/builtin.py Outdated Show resolved Hide resolved

abhinand5 mentioned this pull request Mar 15, 2026

[Feature] limit thinking tokens (hard limit) #20859

Open

4 tasks

abhinand5 force-pushed the main branch 2 times, most recently from 7eca019 to d0169e4 Compare March 15, 2026 18:08

abhinand5 and others added 3 commits March 15, 2026 23:45

[Feature] Implement reasoning budget and message for chat completion …

2185ca8

…and sampling parameters Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>

[Fix] Add newline to reasoning budget exceeded message in SamplingParams

381b66a

Co-authored-by: Claude Signed-off-by: Abhinand B <abhinand5899@gmail.com>

[Fix] Update exception handler for ReasoningBudgetLogitsProcessor

1e97d78

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Abhinand B <abhinand5899@gmail.com>

abhinand5 force-pushed the main branch from d0169e4 to 1e97d78 Compare March 15, 2026 18:16

[Fix] pre-commit fixes for ReasoningBudgetLogitsProcessor

0ae2023

Signed-off-by: Abhinand B <abhinand5899@gmail.com>

abhinand5 marked this pull request as ready for review March 16, 2026 03:19

abhinand5 requested review from 22quinn, DarkLight1337, NickLucche, aarnphm, chaunceyjiang, houseroad, njhill and russellb as code owners March 16, 2026 03:19

Merge branch 'main' into main

15b8880

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser#37112

[Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser#37112
abhinand5 wants to merge 5 commits intovllm-project:mainfrom
abhinand5:main

abhinand5 commented Mar 15, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

abhinand5 commented Mar 15, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Why another PR?

Design

API surface

Test Plan

Test Result

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abhinand5 commented Mar 15, 2026 •

edited by github-actions bot

Loading