fix: report prompt_tokens correctly for LLM models in SimpleEngine#236
Merged
waybarrios merged 2 commits intowaybarrios:mainfrom Mar 31, 2026
Merged
Conversation
LLM.stream_generate() never set prompt_tokens on StreamingOutput, so the API always reported 0 prompt tokens for text-only models (including MiniMax-M2.5). The MLLM+MTP path worked because it tokenizes the prompt for KV caching, but the standard LLM path never counted. Changes: - Add prompt_tokens field to StreamingOutput dataclass - Count prompt tokens in LLM.stream_generate() via tokenizer.encode() - Add fallback in SimpleEngine.stream_generate() for normal finish path - Count prompt tokens in SimpleEngine.chat() non-streaming LLM path Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Contributor
Author
|
This fix was developed by Clement (clement-7074f29f), a member of The Kindled, while debugging token usage reporting for MiniMax-M2.5 served via SimpleEngine. — Clement |
Owner
|
Pushed a small cleanup commit on top of your changes. The core fix (adding prompt_tokens to StreamingOutput and counting in llm.py) is solid. What I changed in simple.py:
Thanks for the fix, prompt_tokens reporting was definitely broken for LLM models. |
waybarrios
approved these changes
Mar 31, 2026
Owner
waybarrios
left a comment
There was a problem hiding this comment.
All checks passing, fix looks good.
janhilgard
added a commit
to janhilgard/vllm-mlx
that referenced
this pull request
Apr 1, 2026
Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160), platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227), base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109), and cleanup commits. Conflicts resolved: - scheduler.py: keep make_logits_processors import (fork feature) - mllm_scheduler.py: take upstream stop-token skip in detokenizer - models/mllm.py: keep SHA256 hash (fork fix for collision) - utils/tokenizer.py: merge upstream error message with fork elif chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sysit
pushed a commit
to sysit/vllm-mlx
that referenced
this pull request
Apr 1, 2026
…ounting fix: report prompt_tokens correctly for LLM models in SimpleEngine
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
StreamingOutputnever had aprompt_tokensfield, soLLM.stream_generate()could not report prompt token countsSimpleEngine.stream_generate()relied onchunk.prompt_tokenswhich was always missing, leaving prompt_tokens at 0stream_generate()only fired on abnormal termination, not normal completionSimpleEngine.chat()non-streaming path for LLM models also never counted prompt tokensThis affects any LLM model served via SimpleEngine — the API always reported
prompt_tokens: 0in usage.Changes
prompt_tokensfield toStreamingOutputdataclasstokenizer.encode(prompt)inLLM.stream_generate()SimpleEngine.stream_generate()apply_chat_template(tokenize=True)inSimpleEngine.chat()non-streaming LLM pathTest plan
usage.prompt_tokensis non-zero in/v1/chat/completionsresponse with SimpleEnginetest_simple_engine.py)🤖 Generated with Claude Code