Skip to content

[OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions#37955

Open
QwertyJack wants to merge 1 commit into
vllm-project:mainfrom
QwertyJack:fix/minimax-m2-usage-reasoning
Open

[OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions#37955
QwertyJack wants to merge 1 commit into
vllm-project:mainfrom
QwertyJack:fix/minimax-m2-usage-reasoning

Conversation

@QwertyJack

@QwertyJack QwertyJack commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add completion_tokens_details.reasoning_tokens to OpenAI chat-completions usage
  • count MiniMax M2 reasoning tokens using the first </think> token as the reasoning boundary for both minimax_m2 and minimax_m2_append_think
  • cover non-streaming and streaming usage reporting with targeted unit tests
  • mark the new pure unit tests with skip_global_cleanup because they do not allocate accelerator state and do not need global cleanup

Testing

  • python -m pytest -q tests/entrypoints/openai/chat_completion/test_serving_chat.py -k 'test_chat_usage_includes_reasoning_tokens_for_minimax_parser or test_chat_stream_usage_includes_reasoning_tokens_for_minimax_parser'
  • python -m pytest -q tests/reasoning/test_minimax_m2_reasoning_parser.py tests/reasoning/test_minimax_m2_append_reasoning_parser.py -k count_reasoning_tokens

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the capability to track and report 'reasoning tokens' as part of the usage information for chat completions, specifically for Minimax M2 models. This involves defining a new CompletionTokenUsageInfo data structure, implementing the logic to count reasoning tokens within the MiniMaxM2ReasoningParser and MiniMaxM2AppendThinkReasoningParser classes, and integrating this counting into the OpenAIServingChat class for both full and streaming chat completions. New unit tests were added to verify the correct calculation and reporting of reasoning tokens for these parsers.

@QwertyJack QwertyJack changed the title [OpenAI] Fix MiniMax M2 reasoning token usage accounting [OpenAI] Fix reasoning_tokens accounting for MiniMax M2 chat completions Mar 24, 2026
@QwertyJack QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from 03be0f6 to 2598da4 Compare March 24, 2026 03:44
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 30, 2026
## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: #7700
- vLLM Ascend PR: #7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
PiratePai pushed a commit to PiratePai/vllm-ascend that referenced this pull request May 7, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: PiratePai <416932041@qq.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
nanxingMy pushed a commit to nanxingMy/vllm-ascend that referenced this pull request May 15, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: nanxing <1014662416@qq.com>
@QwertyJack QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from 2598da4 to f232eaa Compare May 17, 2026 15:11
@QwertyJack QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f232eaa to f976977 Compare June 1, 2026 14:49
@mergify

mergify Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Hi @QwertyJack, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@QwertyJack QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f976977 to f9d1af8 Compare June 1, 2026 15:03
@mergify

mergify Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @QwertyJack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 2, 2026
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
@QwertyJack QwertyJack force-pushed the fix/minimax-m2-usage-reasoning branch from f9d1af8 to 3232393 Compare June 5, 2026 10:21
@mergify mergify Bot removed the needs-rebase label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant