Skip to content

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting#7700

Merged
yiz-liu merged 4 commits into
vllm-project:releases/v0.18.0from
QwertyJack:pr/minimax-usage-accounting
Mar 27, 2026
Merged

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting#7700
yiz-liu merged 4 commits into
vllm-project:releases/v0.18.0from
QwertyJack:pr/minimax-usage-accounting

Conversation

@QwertyJack

Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

This backports the MiniMax M2 reasoning-token usage accounting fix onto releases/v0.18.0 for vllm-ascend.

The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by:

  • registering patch_minimax_usage_accounting on the release branch
  • backporting completion_tokens_details.reasoning_tokens into chat usage generation
  • fixing MiniMax reasoning token counting for </think>-delimited outputs without depending on the GLM suffix patch

Does this PR introduce any user-facing change?

Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch.

How was this patch tested?

  • python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
  • python - <<'PY' import check for vllm_ascend.patch.platform.patch_minimax_usage_accounting on top of releases/v0.18.0

No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch.

Cherry-pick the MiniMax reasoning-token usage accounting backport onto releases/v0.18.0 and keep the patch self-contained for a standalone release PR.\n\n- register the MiniMax usage-accounting patch on the release branch\n- backport reasoning token details into chat usage generation\n- avoid the unrelated GLM tool-call suffix dependency from the other local commit\n\n(cherry picked from commit f87bf5db7264df84e1baa8fc9f419459242771e9)

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
@QwertyJack QwertyJack requested a review from wangxiyuan as a code owner March 26, 2026 14:44
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request backports a critical bugfix to the releases/v0.18.0 branch of vllm-ascend, addressing inaccurate reasoning token accounting for MiniMax M2 responses in OpenAI-compatible chat usage. The changes ensure that the completion_tokens_details.reasoning_tokens field is correctly populated, providing accurate usage metrics for models that use </think> as a boundary token.

Highlights

  • MiniMax M2 Reasoning Token Accounting Fix: Corrected the accounting of reasoning tokens for MiniMax M2 responses in OpenAI-compatible chat usage.
  • Backport to Release Branch: Backported the fix to the releases/v0.18.0 branch for vllm-ascend to ensure stability and accuracy.
  • Self-Contained Patch: Implemented the fix as a self-contained patch, avoiding dependencies on other local GLM patch commits.
  • UsageInfo Extension: Extended UsageInfo with completion_tokens_details.reasoning_tokens to properly report reasoning token counts.
  • Chat Usage Generation Update: Updated chat streaming and non-streaming usage generation to propagate the corrected reasoning token counts, specifically for </think>-delimited outputs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new patch to vllm-ascend to implement MiniMax M2 reasoning token accounting for OpenAI chat usage. It extends usage information models and modifies chat completion generators to correctly track reasoning tokens. A critical issue was identified in the _count_minimax_reasoning_tokens function, which currently miscounts all completion tokens as reasoning tokens if the end_token_id is not found, leading to inaccurate usage accounting. This function should be updated to return 0 in such scenarios.

Comment thread vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
QwertyJack and others added 3 commits March 26, 2026 15:04
Add a focused regression test for the MiniMax usage-accounting backport.\n\nThe test locks in the intended MiniMax parser semantics for release v0.18.0:\n- tokens before the first </think> count as reasoning\n- if </think> has not appeared yet, all generated tokens are reasoning\n- if </think> is the first token, reasoning token count is zero

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Apply the repository formatter output required by the pre-commit CI job for the MiniMax usage-accounting backport and its regression test.

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
@yiz-liu yiz-liu changed the title [Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting [v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting Mar 27, 2026
@yiz-liu yiz-liu merged commit 53cc225 into vllm-project:releases/v0.18.0 Mar 27, 2026
17 checks passed
@yiz-liu yiz-liu added this to the v0.18.0rc1 milestone Mar 27, 2026
keyi-zz pushed a commit to keyi-zz/vllm-ascend that referenced this pull request Apr 20, 2026
…unting (vllm-project#7700)

### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.

The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch

### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.

### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`

No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
wangxiyuan pushed a commit that referenced this pull request Apr 30, 2026
## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: #7700
- vLLM Ascend PR: #7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
PiratePai pushed a commit to PiratePai/vllm-ascend that referenced this pull request May 7, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: PiratePai <416932041@qq.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
nanxingMy pushed a commit to nanxingMy/vllm-ascend that referenced this pull request May 15, 2026
…project#8831)

## What this PR does / why we need it?

Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM
Ascend platform patch layer for the vLLM 0.19.1 runtime.

The patch:
- adds completion_tokens_details.reasoning_tokens to UsageInfo
- fixes MiniMax-M2 reasoning token counting before the first </think>
token
- wraps chat streaming/non-streaming generators to track raw output
token ids and inject reasoning usage details
- avoids source-code extraction/replacement of OpenAIServingChat methods

References:
- vLLM upstream PR: vllm-project/vllm#37955
- vLLM Ascend PR: vllm-project#7700
- vLLM Ascend PR: vllm-project#7835

## Does this PR introduce _any_ user-facing change?

Yes. MiniMax-M2 OpenAI-compatible chat responses can now include
accurate completion_tokens_details.reasoning_tokens usage accounting.

## How was this patch tested?

- PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - 10 passed
- ruff check
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
vllm_ascend/patch/platform/__init__.py
  - passed
- python -m py_compile
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
  - passed

Note: running the test without --confcutdir currently loads the
repo-wide UT conftest, which imports worker patches and fails before
test collection because ../vllm v0.19.1 does not contain
vllm.model_executor.models.qwen3_dflash.
- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@d886c26

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: nanxing <1014662416@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants