[OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions by QwertyJack · Pull Request #37955 · vllm-project/vllm

QwertyJack · 2026-03-24T03:25:15Z

Summary

add completion_tokens_details.reasoning_tokens to OpenAI chat-completions usage
count MiniMax M2 reasoning tokens using the first </think> token as the reasoning boundary for both minimax_m2 and minimax_m2_append_think
cover non-streaming and streaming usage reporting with targeted unit tests
mark the new pure unit tests with skip_global_cleanup because they do not allocate accelerator state and do not need global cleanup

Testing

python -m pytest -q tests/entrypoints/openai/chat_completion/test_serving_chat.py -k 'test_chat_usage_includes_reasoning_tokens_for_minimax_parser or test_chat_stream_usage_includes_reasoning_tokens_for_minimax_parser'
python -m pytest -q tests/reasoning/test_minimax_m2_reasoning_parser.py tests/reasoning/test_minimax_m2_append_reasoning_parser.py -k count_reasoning_tokens

gemini-code-assist

Code Review

This pull request introduces the capability to track and report 'reasoning tokens' as part of the usage information for chat completions, specifically for Minimax M2 models. This involves defining a new CompletionTokenUsageInfo data structure, implementing the logic to count reasoning tokens within the MiniMaxM2ReasoningParser and MiniMaxM2AppendThinkReasoningParser classes, and integrating this counting into the OpenAIServingChat class for both full and streaming chat completions. New unit tests were added to verify the correct calculation and reporting of reasoning tokens for these parsers.

## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: #7700 - vLLM Ascend PR: #7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: PiratePai <416932041@qq.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: nanxing <1014662416@qq.com>

mergify · 2026-06-01T14:57:19Z

Hi @QwertyJack, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-06-02T05:01:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @QwertyJack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

QwertyJack requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners March 24, 2026 03:25

mergify Bot added the frontend label Mar 24, 2026

gemini-code-assist Bot reviewed Mar 24, 2026

View reviewed changes

QwertyJack changed the title ~~[OpenAI] Fix MiniMax M2 reasoning token usage accounting~~ [OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions Mar 24, 2026

QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from 03be0f6 to 2598da4 Compare March 24, 2026 03:44

QwertyJack mentioned this pull request Mar 26, 2026

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting vllm-project/vllm-ascend#7700

Merged

QwertyJack mentioned this pull request Apr 30, 2026

[BugFix][Platform] Backport MiniMax reasoning usage accounting vllm-project/vllm-ascend#8831

Merged

QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from 2598da4 to f232eaa Compare May 17, 2026 15:11

QwertyJack requested review from bbrowning and sfeng33 as code owners May 17, 2026 15:11

QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f232eaa to f976977 Compare June 1, 2026 14:49

QwertyJack requested a review from AndreasKaratzas as a code owner June 1, 2026 14:49

QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f976977 to f9d1af8 Compare June 1, 2026 15:03

mergify Bot added the needs-rebase label Jun 2, 2026

[OpenAI] Fix MiniMax M2 reasoning token usage accounting

3232393

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f9d1af8 to 3232393 Compare June 5, 2026 10:21

mergify Bot removed the needs-rebase label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions#37955

[OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions#37955
QwertyJack wants to merge 1 commit into
vllm-project:mainfrom
QwertyJack:fix/minimax-m2-usage-reasoning

QwertyJack commented Mar 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

QwertyJack commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

QwertyJack commented Mar 24, 2026 •

edited

Loading