[Bugfix] Set num_cached_tokens = num_computed_tokens if unset by gcanlin · Pull Request #1471 · vllm-project/vllm-omni

gcanlin · 2026-02-25T09:19:49Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Request.num_cached_tokens is initialized to -1 : self.num_cached_tokens = -1 in Request.py. The base vLLM scheduler sets num_cached_tokens = num_computed_tokens when scheduling requests. But OmniGenerationScheduler overrides schedule() with a fast path that never updated num_cached_tokens

Test Plan

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

curl http://localhost:8091/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["audio"]
  }'

Test Result

Before(crash):

[Stage-0] INFO 02-25 08:50:26 [loggers.py:259] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1412101) [Stage-1] INFO 02-25 08:50:27 [mrope.py:345] Multimodal token idx changed!
(Worker pid=1413446) [Stage-2] INFO 02-25 08:50:30 [mrope.py:345] Multimodal token idx changed!
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] AsyncLLM output_handler failed.
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] Traceback (most recent call last):
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/engine/async_llm.py", line 703, in output_handler
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     logger_manager.record(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/metrics/loggers.py", line 1309, in record
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     logger.record(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/metrics/loggers.py", line 1113, in record
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/.venv/lib/python3.12/site-packages/prometheus_client/metrics.py", line 290, in inc
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1406484) WARNING 02-25 08:50:30 [serving_chat.py:1304] final output type: text is not needed by the request
(APIServer pid=1406484) INFO:     127.0.0.1:50888 - "POST /v1/chat/completions HTTP/1.1" 200 OK

After:

ng: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:12:03 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:12:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=1478603) WARNING 02-25 09:18:48 [protocol.py:51] The following fields were present in the request but ignored: {'modalities'}
(APIServer pid=1478603) INFO 02-25 09:18:48 [async_omni.py:327] [AsyncOrchestrator] Entering scheduling loop: stages=3, final_stage=2
[Stage-0] INFO 02-25 09:18:53 [loggers.py:259] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1484185) [Stage-1] INFO 02-25 09:18:54 [mrope.py:345] Multimodal token idx changed!
[Stage-1] INFO 02-25 09:18:58 [loggers.py:259] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 16.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:19:03 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1485435) [Stage-2] INFO 02-25 09:19:04 [mrope.py:345] Multimodal token idx changed!
(APIServer pid=1478603) WARNING 02-25 09:19:05 [serving_chat.py:1304] final output type: text is not needed by the request
(APIServer pid=1478603) INFO:     127.0.0.1:50892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[Stage-1] INFO 02-25 09:19:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:19:13 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:19:14 [loggers.py:259] Engine 000: Avg prompt throughput: 660.8 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-1] INFO 02-25 09:19:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:19:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

linyueqian

LGTM.

One minor question: should the num_cached_tokens assignment be done unconditionally (i.e. drop the if < 0 guard) to match the base scheduler's behavior exactly? The base scheduler always sets num_cached_tokens = num_computed_tokens without a guard. The defensive check is fine for now, just wondering if there's a reason to prefer it over the unconditional approach.

linyueqian

Correction on my earlier comment: I checked the base vLLM scheduler (vllm/v1/core/sched/scheduler.py:798) and it also uses the same if request.num_cached_tokens < 0 guard. So the approach here is perfectly consistent with upstream. No concerns at all!

[Bugfix] set num_cached_tokens = num_computed_tokens if unset

5a74a1e

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin requested a review from hsliuustc0106 as a code owner February 25, 2026 09:19

gcanlin changed the title ~~[Bugfix] set num_cached_tokens = num_computed_tokens if unset~~ [Bugfix] Set num_cached_tokens = num_computed_tokens if unset Feb 25, 2026

hsliuustc0106 mentioned this pull request Feb 25, 2026

[Bugfix]: initialize num_cached_tokens in generation scheduler to prevent metrics crash #1478

Closed

2 tasks

linyueqian approved these changes Feb 25, 2026

View reviewed changes

linyueqian reviewed Feb 25, 2026

View reviewed changes

gcanlin closed this Feb 26, 2026

hsliuustc0106 mentioned this pull request Mar 3, 2026

[Bugfix] Fix transformers 5.x compat issues in online TTS serving #1536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Set num_cached_tokens = num_computed_tokens if unset#1471

[Bugfix] Set num_cached_tokens = num_computed_tokens if unset#1471
gcanlin wants to merge 1 commit into
vllm-project:mainfrom
gcanlin:cached_tokens

gcanlin commented Feb 25, 2026

Uh oh!

linyueqian left a comment •

edited

Loading

Uh oh!

linyueqian left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gcanlin commented Feb 25, 2026

Purpose

Test Plan

Test Result

Uh oh!

linyueqian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linyueqian left a comment •

edited

Loading