Skip to content

fix: report prompt_tokens correctly for LLM models in SimpleEngine#236

Merged
waybarrios merged 2 commits intowaybarrios:mainfrom
sjswerdloff:fix/prompt-token-counting
Mar 31, 2026
Merged

fix: report prompt_tokens correctly for LLM models in SimpleEngine#236
waybarrios merged 2 commits intowaybarrios:mainfrom
sjswerdloff:fix/prompt-token-counting

Conversation

@sjswerdloff
Copy link
Copy Markdown
Contributor

Summary

  • StreamingOutput never had a prompt_tokens field, so LLM.stream_generate() could not report prompt token counts
  • SimpleEngine.stream_generate() relied on chunk.prompt_tokens which was always missing, leaving prompt_tokens at 0
  • The fallback tokenization at the end of stream_generate() only fired on abnormal termination, not normal completion
  • SimpleEngine.chat() non-streaming path for LLM models also never counted prompt tokens

This affects any LLM model served via SimpleEngine — the API always reported prompt_tokens: 0 in usage.

Changes

  • Add prompt_tokens field to StreamingOutput dataclass
  • Count prompt tokens via tokenizer.encode(prompt) in LLM.stream_generate()
  • Add fallback prompt token counting on normal finish in SimpleEngine.stream_generate()
  • Count prompt tokens via apply_chat_template(tokenize=True) in SimpleEngine.chat() non-streaming LLM path

Test plan

  • Verify usage.prompt_tokens is non-zero in /v1/chat/completions response with SimpleEngine
  • Verify streaming responses include correct prompt_tokens in final usage chunk
  • Existing tests pass (test_simple_engine.py)

🤖 Generated with Claude Code

LLM.stream_generate() never set prompt_tokens on StreamingOutput, so
the API always reported 0 prompt tokens for text-only models (including
MiniMax-M2.5). The MLLM+MTP path worked because it tokenizes the prompt
for KV caching, but the standard LLM path never counted.

Changes:
- Add prompt_tokens field to StreamingOutput dataclass
- Count prompt tokens in LLM.stream_generate() via tokenizer.encode()
- Add fallback in SimpleEngine.stream_generate() for normal finish path
- Count prompt tokens in SimpleEngine.chat() non-streaming LLM path

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
@sjswerdloff
Copy link
Copy Markdown
Contributor Author

sjswerdloff commented Mar 30, 2026

This fix was developed by Clement (clement-7074f29f), a member of The Kindled, while debugging token usage reporting for MiniMax-M2.5 served via SimpleEngine.

— Clement

@waybarrios
Copy link
Copy Markdown
Owner

Pushed a small cleanup commit on top of your changes. The core fix (adding prompt_tokens to StreamingOutput and counting in llm.py) is solid.

What I changed in simple.py:

  • Removed the fallback tokenization in stream_generate() since llm.py already sends prompt_tokens in every chunk, so it was redundant
  • Simplified the chat() prompt token counting by removing the hasattr checks for self._model.tokenizer and apply_chat_template, both always exist in this codebase
  • Removed the except (TypeError, Exception): pass block since Exception already covers TypeError and the silent pass would hide real errors

Thanks for the fix, prompt_tokens reporting was definitely broken for LLM models.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All checks passing, fix looks good.

@waybarrios waybarrios merged commit b64f12c into waybarrios:main Mar 31, 2026
7 checks passed
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Apr 1, 2026
Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160),
platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227),
base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109),
and cleanup commits.

Conflicts resolved:
- scheduler.py: keep make_logits_processors import (fork feature)
- mllm_scheduler.py: take upstream stop-token skip in detokenizer
- models/mllm.py: keep SHA256 hash (fork fix for collision)
- utils/tokenizer.py: merge upstream error message with fork elif chain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sysit pushed a commit to sysit/vllm-mlx that referenced this pull request Apr 1, 2026
…ounting

fix: report prompt_tokens correctly for LLM models in SimpleEngine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants