Skip to content

Support reasoning_tokens with openai style in serving_chat#17764

Closed
lw9527 wants to merge 4 commits into
sgl-project:mainfrom
lw9527:add_reasoning_tokens
Closed

Support reasoning_tokens with openai style in serving_chat#17764
lw9527 wants to merge 4 commits into
sgl-project:mainfrom
lw9527:add_reasoning_tokens

Conversation

@lw9527
Copy link
Copy Markdown
Contributor

@lw9527 lw9527 commented Jan 26, 2026

Motivation

Support reasoning_tokens with openai style in serving_chat

Modifications

  • support reasoning_tokens with openai style in serving_chat
  • add --enable-reasoning-tokens server args, reasoning_tokens are not returned by default.

Accuracy Tests

non stream
import openai client = openai.Client( api_key="", base_url="http://127.0.0.1:40823/v1", ) response = client.chat.completions.create( model="default", messages=[{"role":"user","content":"who are you?"}], max_tokens=20000, temperature=1, stream=False, stream_options={"include_usage":True} ) print(response)
image

stream
image

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @lw9527, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the capability to track and report 'reasoning tokens' within the OpenAI-compatible chat serving endpoint. This enhancement provides users with a more granular breakdown of token consumption, distinguishing between standard completion tokens and those generated during the model's internal reasoning steps. The feature is configurable via a new server-side flag, allowing for flexible adoption and detailed cost analysis for complex model interactions.

Highlights

  • Reasoning Token Support: Introduced support for reporting reasoning_tokens in the OpenAI-style serving chat, providing more detailed token usage information.
  • New Server Argument: Added a new server argument --enable-reasoning-tokens to explicitly enable or disable the calculation and reporting of reasoning tokens.
  • Enhanced UsageInfo Protocol: The UsageInfo protocol has been updated to include CompletionTokensDetails and PromptTokensDetails for more granular reporting of token types, aligning with OpenAI's API structure. The existing reasoning_tokens field is now marked as deprecated for backward compatibility.
  • Reasoning Token Calculation: Implemented a new calculate_reasoning_tokens method within the ReasoningParser to accurately count tokens associated with the model's internal reasoning process, including start and end tokens.
  • Comprehensive Testing: Added extensive unit tests for both the serving_chat functionality and the ReasoningParser to ensure correct calculation and reporting of reasoning tokens in various scenarios, including streaming and non-streaming responses, and multiple choices.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for reasoning_tokens in the OpenAI-style serving, including new fields in the protocol, logic for calculating these tokens in chat and responses, and corresponding server arguments and tests. The changes are well-covered by tests, ensuring the new functionality works as expected for both streaming and non-streaming scenarios, and for multiple choices. Overall, the implementation is robust and adds valuable functionality.

Comment on lines +499 to +503
completion_tokens_details = None
if num_reasoning_tokens > 0:
completion_tokens_details = CompletionTokensDetails(
reasoning_tokens=num_reasoning_tokens,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The completion_tokens_details is only created if num_reasoning_tokens > 0. However, num_reasoning_tokens is only populated within the if self.use_harmony: block (lines 451-458). If self.use_harmony is False, num_reasoning_tokens will always be 0, preventing completion_tokens_details from being set for non-Harmony contexts, even if reasoning tokens were calculated and available in meta_info from serving_chat.py.

Comment on lines +810 to +813
reasoning_parser = ReasoningParser(
self.reasoning_parser,
request.stream_reasoning,
is_force_reasoning,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ReasoningParser is instantiated inside the loop, which means it will be created for each choice (request.n). If the model_type, stream_reasoning, and force_reasoning are constant for all choices within a single request, it would be more efficient to instantiate reasoning_parser once outside this loop and reuse it.

Suggested change
reasoning_parser = ReasoningParser(
self.reasoning_parser,
request.stream_reasoning,
is_force_reasoning,
reasoning_parser_instance = ReasoningParser(
self.reasoning_parser,
request.stream_reasoning,
is_force_reasoning,
)
for index, full_content in stream_buffers.items():
reasoning_tokens[index] = (
reasoning_parser_instance.calculate_reasoning_tokens(
full_content, self.tokenizer_manager.tokenizer
)
)

Comment thread python/sglang/srt/parser/reasoning_parser.py Outdated
@lw9527 lw9527 force-pushed the add_reasoning_tokens branch from 3fa4303 to 8c5be32 Compare February 3, 2026 07:37
lw9527 and others added 3 commits February 3, 2026 15:42
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@shanemort1982
Copy link
Copy Markdown

Confirmation: We independently tested a similar fix and it works! ✅

We encountered the same reasoning_tokens: 0 bug (issues #15446, #15508, #17873) and independently arrived at essentially the same solution as this PR. Our testing confirms this approach works correctly.

Our Test Results

Environment:

  • SGLang v0.5.8
  • Model: GLM-4.7 (reasoning model)
  • Testing: Both streaming and non-streaming modes

Non-streaming test:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-4.7","messages":[{"role":"user","content":"What is 2+2?"}]}' \
  | jq '.usage'

Result:

{
  "prompt_tokens": 12,
  "total_tokens": 88,
  "completion_tokens": 76,
  "reasoning_tokens": 67
}

reasoning_tokens correctly reported (was 0 before fix)

Streaming test:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-4.7","messages":[{"role":"user","content":"What is 2+2?"}],"stream":true,"stream_options":{"include_usage":true}}' \
  | grep '"usage"' | tail -1 | sed 's/^data: //' | jq '.usage'

Result:

{
  "prompt_tokens": 12,
  "total_tokens": 127,
  "completion_tokens": 115,
  "reasoning_tokens": 105
}

reasoning_tokens correctly reported in streaming mode

Why This Approach is Correct

We analyzed the codebase and confirmed:

  1. The reasoning text is correctly parsed by ReasoningParser.parse_non_stream() into reasoning_content
  2. But the token count is never calculated or set on UsageInfo
  3. The backend's context.py initializes num_reasoning_tokens = 0 but never updates it during generation
  4. The Responses API (serving_responses.py) handles this correctly, but Chat Completions was missed

Your solution of re-tokenizing the reasoning text at the API layer is pragmatic and works. While the "ideal" fix would be tracking tokens at the engine level, that would be a much larger change requiring modifications to the generation pipeline.

Observations

Your PR improves on our initial implementation by:

  • ✅ Adding --enable-reasoning-tokens flag for opt-in (addresses performance concerns)
  • ✅ Adding the calculate_reasoning_tokens() method to ReasoningParser (clean encapsulation)
  • ✅ Handling the streaming case properly by accumulating content before calculating

Performance Note

The re-tokenization overhead is minimal since:

  • It only runs once per response (non-streaming) or once at the end (streaming)
  • Reasoning chains are typically much shorter than the full completion
  • The opt-in flag allows users to disable it if needed

Recommendation

This PR should be merged. We've been running this fix in production with several reasoning models (GLM-4.7, DeepSeek-R1, QwQ) and it works reliably. The bug significantly impacts usage tracking and billing accuracy for reasoning models.

Tested models:

  • ✅ GLM-4.7
  • ✅ DeepSeek-R1
  • ✅ QwQ-32B

Happy to provide additional test results or feedback if helpful!

@hnyls2002
Copy link
Copy Markdown
Collaborator

Closing as duplicate of #15562 which has been merged.

@hnyls2002 hnyls2002 closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants