[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting#7700
Conversation
Cherry-pick the MiniMax reasoning-token usage accounting backport onto releases/v0.18.0 and keep the patch self-contained for a standalone release PR.\n\n- register the MiniMax usage-accounting patch on the release branch\n- backport reasoning token details into chat usage generation\n- avoid the unrelated GLM tool-call suffix dependency from the other local commit\n\n(cherry picked from commit f87bf5db7264df84e1baa8fc9f419459242771e9) Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request backports a critical bugfix to the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new patch to vllm-ascend to implement MiniMax M2 reasoning token accounting for OpenAI chat usage. It extends usage information models and modifies chat completion generators to correctly track reasoning tokens. A critical issue was identified in the _count_minimax_reasoning_tokens function, which currently miscounts all completion tokens as reasoning tokens if the end_token_id is not found, leading to inaccurate usage accounting. This function should be updated to return 0 in such scenarios.
Add a focused regression test for the MiniMax usage-accounting backport.\n\nThe test locks in the intended MiniMax parser semantics for release v0.18.0:\n- tokens before the first </think> count as reasoning\n- if </think> has not appeared yet, all generated tokens are reasoning\n- if </think> is the first token, reasoning token count is zero Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Apply the repository formatter output required by the pre-commit CI job for the MiniMax usage-accounting backport and its regression test. Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
…unting (vllm-project#7700) ### What this PR does / why we need it? This backports the MiniMax M2 reasoning-token usage accounting fix onto `releases/v0.18.0` for vllm-ascend. The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by: - registering `patch_minimax_usage_accounting` on the release branch - backporting `completion_tokens_details.reasoning_tokens` into chat usage generation - fixing MiniMax reasoning token counting for `</think>`-delimited outputs without depending on the GLM suffix patch ### Does this PR introduce _any_ user-facing change? Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch. ### How was this patch tested? - `python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py` - `python - <<'PY'` import check for `vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of `releases/v0.18.0` No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: #7700 - vLLM Ascend PR: #7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: PiratePai <416932041@qq.com>
…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
…project#8831) ## What this PR does / why we need it? Backports the MiniMax-M2 reasoning usage accounting fix into the vLLM Ascend platform patch layer for the vLLM 0.19.1 runtime. The patch: - adds completion_tokens_details.reasoning_tokens to UsageInfo - fixes MiniMax-M2 reasoning token counting before the first </think> token - wraps chat streaming/non-streaming generators to track raw output token ids and inject reasoning usage details - avoids source-code extraction/replacement of OpenAIServingChat methods References: - vLLM upstream PR: vllm-project/vllm#37955 - vLLM Ascend PR: vllm-project#7700 - vLLM Ascend PR: vllm-project#7835 ## Does this PR introduce _any_ user-facing change? Yes. MiniMax-M2 OpenAI-compatible chat responses can now include accurate completion_tokens_details.reasoning_tokens usage accounting. ## How was this patch tested? - PYTHONPATH=../vllm:. pytest -q --confcutdir=tests/ut/patch/platform tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - 10 passed - ruff check vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py vllm_ascend/patch/platform/__init__.py - passed - python -m py_compile vllm_ascend/patch/platform/patch_minimax_usage_accounting.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py - passed Note: running the test without --confcutdir currently loads the repo-wide UT conftest, which imports worker patches and fails before test collection because ../vllm v0.19.1 does not contain vllm.model_executor.models.qwen3_dflash. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@d886c26 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: nanxing <1014662416@qq.com>
What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
releases/v0.18.0for vllm-ascend.The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by:
patch_minimax_usage_accountingon the release branchcompletion_tokens_details.reasoning_tokensinto chat usage generation</think>-delimited outputs without depending on the GLM suffix patchDoes this PR introduce any user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch.
How was this patch tested?
python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.pypython - <<'PY'import check forvllm_ascend.patch.platform.patch_minimax_usage_accountingon top ofreleases/v0.18.0No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch.