[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting by QwertyJack · Pull Request #7700 · vllm-project/vllm-ascend

QwertyJack · 2026-03-26T14:44:12Z

What this PR does / why we need it?

This backports the MiniMax M2 reasoning-token usage accounting fix onto releases/v0.18.0 for vllm-ascend.

The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by:

registering patch_minimax_usage_accounting on the release branch
backporting completion_tokens_details.reasoning_tokens into chat usage generation
fixing MiniMax reasoning token counting for </think>-delimited outputs without depending on the GLM suffix patch

Does this PR introduce any user-facing change?

Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch.

How was this patch tested?

python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py
python - <<'PY' import check for vllm_ascend.patch.platform.patch_minimax_usage_accounting on top of releases/v0.18.0

No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch.

Cherry-pick the MiniMax reasoning-token usage accounting backport onto releases/v0.18.0 and keep the patch self-contained for a standalone release PR.\n\n- register the MiniMax usage-accounting patch on the release branch\n- backport reasoning token details into chat usage generation\n- avoid the unrelated GLM tool-call suffix dependency from the other local commit\n\n(cherry picked from commit f87bf5db7264df84e1baa8fc9f419459242771e9) Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

gemini-code-assist · 2026-03-26T14:44:32Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request backports a critical bugfix to the releases/v0.18.0 branch of vllm-ascend, addressing inaccurate reasoning token accounting for MiniMax M2 responses in OpenAI-compatible chat usage. The changes ensure that the completion_tokens_details.reasoning_tokens field is correctly populated, providing accurate usage metrics for models that use </think> as a boundary token.

Highlights

MiniMax M2 Reasoning Token Accounting Fix: Corrected the accounting of reasoning tokens for MiniMax M2 responses in OpenAI-compatible chat usage.
Backport to Release Branch: Backported the fix to the releases/v0.18.0 branch for vllm-ascend to ensure stability and accuracy.
Self-Contained Patch: Implemented the fix as a self-contained patch, avoiding dependencies on other local GLM patch commits.
UsageInfo Extension: Extended UsageInfo with completion_tokens_details.reasoning_tokens to properly report reasoning token counts.
Chat Usage Generation Update: Updated chat streaming and non-streaming usage generation to propagate the corrected reasoning token counts, specifically for </think>-delimited outputs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new patch to vllm-ascend to implement MiniMax M2 reasoning token accounting for OpenAI chat usage. It extends usage information models and modifies chat completion generators to correctly track reasoning tokens. A critical issue was identified in the _count_minimax_reasoning_tokens function, which currently miscounts all completion tokens as reasoning tokens if the end_token_id is not found, leading to inaccurate usage accounting. This function should be updated to return 0 in such scenarios.

Add a focused regression test for the MiniMax usage-accounting backport.\n\nThe test locks in the intended MiniMax parser semantics for release v0.18.0:\n- tokens before the first </think> count as reasoning\n- if </think> has not appeared yet, all generated tokens are reasoning\n- if </think> is the first token, reasoning token count is zero Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

Apply the repository formatter output required by the pre-commit CI job for the MiniMax usage-accounting backport and its regression test. Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…unting (vllm-project#7700) ### What this PR does / why we need it? This backports the MiniMax M2 reasoning-token usage accounting fix onto `releases/v0.18.0` for vllm-ascend. The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by: - registering `patch_minimax_usage_accounting` on the release branch - backporting `completion_tokens_details.reasoning_tokens` into chat usage generation - fixing MiniMax reasoning token counting for `</think>`-delimited outputs without depending on the GLM suffix patch ### Does this PR introduce _any_ user-facing change? Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch. ### How was this patch tested? - `python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py` - `python - <<'PY'` import check for `vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of `releases/v0.18.0` No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: #7700 - vLLM Ascend PR: #7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: PiratePai <416932041@qq.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: nanxing <1014662416@qq.com>

QwertyJack requested a review from wangxiyuan as a code owner March 26, 2026 14:44

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread vllm_ascend/patch/platform/patch_minimax_usage_accounting.py

QwertyJack and others added 3 commits March 26, 2026 15:04

Merge branch 'releases/v0.18.0' into pr/minimax-usage-accounting

12a95fd

QwertyJack mentioned this pull request Mar 26, 2026

[Bugfix][Platform] Fix GLM47 tool-call finish backfill QwertyJack/vllm-ascend#2

Closed

yiz-liu changed the title ~~[Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting~~ [v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting Mar 27, 2026

yiz-liu merged commit 53cc225 into vllm-project:releases/v0.18.0 Mar 27, 2026
17 checks passed

yiz-liu added this to the v0.18.0rc1 milestone Mar 27, 2026

QwertyJack mentioned this pull request Apr 30, 2026

[BugFix][Platform] Backport MiniMax reasoning usage accounting #8831

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting#7700

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting#7700
yiz-liu merged 4 commits into
vllm-project:releases/v0.18.0from
QwertyJack:pr/minimax-usage-accounting

QwertyJack commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

QwertyJack commented Mar 26, 2026

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants