fix: report prompt_tokens correctly for LLM models in SimpleEngine by sjswerdloff · Pull Request #236 · waybarrios/vllm-mlx

sjswerdloff · 2026-03-30T06:15:49Z

Summary

StreamingOutput never had a prompt_tokens field, so LLM.stream_generate() could not report prompt token counts
SimpleEngine.stream_generate() relied on chunk.prompt_tokens which was always missing, leaving prompt_tokens at 0
The fallback tokenization at the end of stream_generate() only fired on abnormal termination, not normal completion
SimpleEngine.chat() non-streaming path for LLM models also never counted prompt tokens

This affects any LLM model served via SimpleEngine — the API always reported prompt_tokens: 0 in usage.

Changes

Add prompt_tokens field to StreamingOutput dataclass
Count prompt tokens via tokenizer.encode(prompt) in LLM.stream_generate()
Add fallback prompt token counting on normal finish in SimpleEngine.stream_generate()
Count prompt tokens via apply_chat_template(tokenize=True) in SimpleEngine.chat() non-streaming LLM path

Test plan

Verify usage.prompt_tokens is non-zero in /v1/chat/completions response with SimpleEngine
Verify streaming responses include correct prompt_tokens in final usage chunk
Existing tests pass (test_simple_engine.py)

🤖 Generated with Claude Code

LLM.stream_generate() never set prompt_tokens on StreamingOutput, so the API always reported 0 prompt tokens for text-only models (including MiniMax-M2.5). The MLLM+MTP path worked because it tokenizes the prompt for KV caching, but the standard LLM path never counted. Changes: - Add prompt_tokens field to StreamingOutput dataclass - Count prompt tokens in LLM.stream_generate() via tokenizer.encode() - Add fallback in SimpleEngine.stream_generate() for normal finish path - Count prompt tokens in SimpleEngine.chat() non-streaming LLM path Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

sjswerdloff · 2026-03-30T06:17:31Z

This fix was developed by Clement (clement-7074f29f), a member of The Kindled, while debugging token usage reporting for MiniMax-M2.5 served via SimpleEngine.

— Clement

… checks

waybarrios · 2026-03-31T16:40:24Z

Pushed a small cleanup commit on top of your changes. The core fix (adding prompt_tokens to StreamingOutput and counting in llm.py) is solid.

What I changed in simple.py:

Removed the fallback tokenization in stream_generate() since llm.py already sends prompt_tokens in every chunk, so it was redundant
Simplified the chat() prompt token counting by removing the hasattr checks for self._model.tokenizer and apply_chat_template, both always exist in this codebase
Removed the except (TypeError, Exception): pass block since Exception already covers TypeError and the silent pass would hide real errors

Thanks for the fix, prompt_tokens reporting was definitely broken for LLM models.

waybarrios

All checks passing, fix looks good.

Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160), platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227), base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109), and cleanup commits. Conflicts resolved: - scheduler.py: keep make_logits_processors import (fork feature) - mllm_scheduler.py: take upstream stop-token skip in detokenizer - models/mllm.py: keep SHA256 hash (fork fix for collision) - utils/tokenizer.py: merge upstream error message with fork elif chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ounting fix: report prompt_tokens correctly for LLM models in SimpleEngine

cleanup: remove redundant fallback tokenization and defensive hasattr…

54b4d65

… checks

waybarrios approved these changes Mar 31, 2026

View reviewed changes

waybarrios merged commit b64f12c into waybarrios:main Mar 31, 2026
7 checks passed

sysit pushed a commit to sysit/vllm-mlx that referenced this pull request Apr 1, 2026

Merge pull request waybarrios#236 from sjswerdloff/fix/prompt-token-c…

f1ac6e0

…ounting fix: report prompt_tokens correctly for LLM models in SimpleEngine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: report prompt_tokens correctly for LLM models in SimpleEngine#236

fix: report prompt_tokens correctly for LLM models in SimpleEngine#236
waybarrios merged 2 commits intowaybarrios:mainfrom
sjswerdloff:fix/prompt-token-counting

sjswerdloff commented Mar 30, 2026

Uh oh!

sjswerdloff commented Mar 30, 2026 •

edited

Loading

Uh oh!

waybarrios commented Mar 31, 2026

Uh oh!

waybarrios left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjswerdloff commented Mar 30, 2026

Summary

Changes

Test plan

Uh oh!

sjswerdloff commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waybarrios commented Mar 31, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjswerdloff commented Mar 30, 2026 •

edited

Loading