Support reasoning_tokens with openai style in serving_chat by lw9527 · Pull Request #17764 · sgl-project/sglang

lw9527 · 2026-01-26T12:51:25Z

Motivation

Support reasoning_tokens with openai style in serving_chat

Modifications

support reasoning_tokens with openai style in serving_chat
add --enable-reasoning-tokens server args, reasoning_tokens are not returned by default.

Accuracy Tests

non stream
import openai client = openai.Client( api_key="", base_url="http://127.0.0.1:40823/v1", ) response = client.chat.completions.create( model="default", messages=[{"role":"user","content":"who are you?"}], max_tokens=20000, temperature=1, stream=False, stream_options={"include_usage":True} ) print(response)

stream

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-26T12:51:47Z

Summary of Changes

Hello @lw9527, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the capability to track and report 'reasoning tokens' within the OpenAI-compatible chat serving endpoint. This enhancement provides users with a more granular breakdown of token consumption, distinguishing between standard completion tokens and those generated during the model's internal reasoning steps. The feature is configurable via a new server-side flag, allowing for flexible adoption and detailed cost analysis for complex model interactions.

Highlights

Reasoning Token Support: Introduced support for reporting reasoning_tokens in the OpenAI-style serving chat, providing more detailed token usage information.
New Server Argument: Added a new server argument --enable-reasoning-tokens to explicitly enable or disable the calculation and reporting of reasoning tokens.
Enhanced UsageInfo Protocol: The UsageInfo protocol has been updated to include CompletionTokensDetails and PromptTokensDetails for more granular reporting of token types, aligning with OpenAI's API structure. The existing reasoning_tokens field is now marked as deprecated for backward compatibility.
Reasoning Token Calculation: Implemented a new calculate_reasoning_tokens method within the ReasoningParser to accurately count tokens associated with the model's internal reasoning process, including start and end tokens.
Comprehensive Testing: Added extensive unit tests for both the serving_chat functionality and the ReasoningParser to ensure correct calculation and reporting of reasoning tokens in various scenarios, including streaming and non-streaming responses, and multiple choices.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for reasoning_tokens in the OpenAI-style serving, including new fields in the protocol, logic for calculating these tokens in chat and responses, and corresponding server arguments and tests. The changes are well-covered by tests, ensuring the new functionality works as expected for both streaming and non-streaming scenarios, and for multiple choices. Overall, the implementation is robust and adds valuable functionality.

gemini-code-assist · 2026-01-26T12:53:51Z

+        completion_tokens_details = None
+        if num_reasoning_tokens > 0:
+            completion_tokens_details = CompletionTokensDetails(
+                reasoning_tokens=num_reasoning_tokens,
+            )


The completion_tokens_details is only created if num_reasoning_tokens > 0. However, num_reasoning_tokens is only populated within the if self.use_harmony: block (lines 451-458). If self.use_harmony is False, num_reasoning_tokens will always be 0, preventing completion_tokens_details from being set for non-Harmony contexts, even if reasoning tokens were calculated and available in meta_info from serving_chat.py.

gemini-code-assist · 2026-01-26T12:53:52Z

+                reasoning_parser = ReasoningParser(
+                    self.reasoning_parser,
+                    request.stream_reasoning,
+                    is_force_reasoning,


The ReasoningParser is instantiated inside the loop, which means it will be created for each choice (request.n). If the model_type, stream_reasoning, and force_reasoning are constant for all choices within a single request, it would be more efficient to instantiate reasoning_parser once outside this loop and reuse it.

Suggested change

reasoning_parser = ReasoningParser(

self.reasoning_parser,

request.stream_reasoning,

is_force_reasoning,

reasoning_parser_instance = ReasoningParser(

self.reasoning_parser,

request.stream_reasoning,

is_force_reasoning,

)

for index, full_content in stream_buffers.items():

reasoning_tokens[index] = (

reasoning_parser_instance.calculate_reasoning_tokens(

full_content, self.tokenizer_manager.tokenizer

)

)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

shanemort1982 · 2026-02-05T14:34:55Z

Confirmation: We independently tested a similar fix and it works! ✅

We encountered the same reasoning_tokens: 0 bug (issues #15446, #15508, #17873) and independently arrived at essentially the same solution as this PR. Our testing confirms this approach works correctly.

Our Test Results

Environment:

SGLang v0.5.8
Model: GLM-4.7 (reasoning model)
Testing: Both streaming and non-streaming modes

Non-streaming test:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-4.7","messages":[{"role":"user","content":"What is 2+2?"}]}' \
  | jq '.usage'

Result:

{
  "prompt_tokens": 12,
  "total_tokens": 88,
  "completion_tokens": 76,
  "reasoning_tokens": 67
}

✅ reasoning_tokens correctly reported (was 0 before fix)

Streaming test:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-4.7","messages":[{"role":"user","content":"What is 2+2?"}],"stream":true,"stream_options":{"include_usage":true}}' \
  | grep '"usage"' | tail -1 | sed 's/^data: //' | jq '.usage'

Result:

{
  "prompt_tokens": 12,
  "total_tokens": 127,
  "completion_tokens": 115,
  "reasoning_tokens": 105
}

✅ reasoning_tokens correctly reported in streaming mode

Why This Approach is Correct

We analyzed the codebase and confirmed:

The reasoning text is correctly parsed by ReasoningParser.parse_non_stream() into reasoning_content
But the token count is never calculated or set on UsageInfo
The backend's context.py initializes num_reasoning_tokens = 0 but never updates it during generation
The Responses API (serving_responses.py) handles this correctly, but Chat Completions was missed

Your solution of re-tokenizing the reasoning text at the API layer is pragmatic and works. While the "ideal" fix would be tracking tokens at the engine level, that would be a much larger change requiring modifications to the generation pipeline.

Observations

Your PR improves on our initial implementation by:

✅ Adding --enable-reasoning-tokens flag for opt-in (addresses performance concerns)
✅ Adding the calculate_reasoning_tokens() method to ReasoningParser (clean encapsulation)
✅ Handling the streaming case properly by accumulating content before calculating

Performance Note

The re-tokenization overhead is minimal since:

It only runs once per response (non-streaming) or once at the end (streaming)
Reasoning chains are typically much shorter than the full completion
The opt-in flag allows users to disable it if needed

Recommendation

This PR should be merged. We've been running this fix in production with several reasoning models (GLM-4.7, DeepSeek-R1, QwQ) and it works reliably. The bug significantly impacts usage tracking and billing accuracy for reasoning models.

Tested models:

✅ GLM-4.7
✅ DeepSeek-R1
✅ QwQ-32B

Happy to provide additional test results or feedback if helpful!

hnyls2002 · 2026-04-04T09:19:44Z

Closing as duplicate of #15562 which has been merged.

lw9527 requested review from CatherineSue, JustinTong0323, ispobock, merrymercy and slin1237 as code owners January 26, 2026 12:51

gemini-code-assist Bot reviewed Jan 26, 2026

View reviewed changes

feat:add reasoning_tokens in openai style

8c5be32

lw9527 force-pushed the add_reasoning_tokens branch from 3fa4303 to 8c5be32 Compare February 3, 2026 07:37

lw9527 and others added 3 commits February 3, 2026 15:42

Apply suggestion from @gemini-code-assist[bot]

573a933

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Merge branch 'main' into add_reasoning_tokens

05a4cb1

update

d0280fd

hnyls2002 closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reasoning_tokens with openai style in serving_chat#17764

Support reasoning_tokens with openai style in serving_chat#17764
lw9527 wants to merge 4 commits into
sgl-project:mainfrom
lw9527:add_reasoning_tokens

lw9527 commented Jan 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Uh oh!

Uh oh!

shanemort1982 commented Feb 5, 2026

Uh oh!

hnyls2002 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lw9527 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shanemort1982 commented Feb 5, 2026

Confirmation: We independently tested a similar fix and it works! ✅

Our Test Results

Why This Approach is Correct

Observations

Performance Note

Recommendation

Uh oh!

hnyls2002 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lw9527 commented Jan 26, 2026 •

edited

Loading