Skip to content

[Bugfix] Set num_cached_tokens = num_computed_tokens if unset#1471

Closed
gcanlin wants to merge 1 commit into
vllm-project:mainfrom
gcanlin:cached_tokens
Closed

[Bugfix] Set num_cached_tokens = num_computed_tokens if unset#1471
gcanlin wants to merge 1 commit into
vllm-project:mainfrom
gcanlin:cached_tokens

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Feb 25, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Request.num_cached_tokens is initialized to -1 : self.num_cached_tokens = -1 in Request.py. The base vLLM scheduler sets num_cached_tokens = num_computed_tokens when scheduling requests. But OmniGenerationScheduler overrides schedule() with a fast path that never updated num_cached_tokens

Test Plan

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
curl http://localhost:8091/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["audio"]
  }'

Test Result

Before(crash):

[Stage-0] INFO 02-25 08:50:26 [loggers.py:259] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1412101) [Stage-1] INFO 02-25 08:50:27 [mrope.py:345] Multimodal token idx changed!
(Worker pid=1413446) [Stage-2] INFO 02-25 08:50:30 [mrope.py:345] Multimodal token idx changed!
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] AsyncLLM output_handler failed.
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] Traceback (most recent call last):
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/engine/async_llm.py", line 703, in output_handler
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     logger_manager.record(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/metrics/loggers.py", line 1309, in record
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     logger.record(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/vllm/vllm/v1/metrics/loggers.py", line 1113, in record
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]   File "/home/guocanlin/vllm-omni-workspace/.venv/lib/python3.12/site-packages/prometheus_client/metrics.py", line 290, in inc
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
[Stage-2] ERROR 02-25 08:50:30 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1406484) WARNING 02-25 08:50:30 [serving_chat.py:1304] final output type: text is not needed by the request
(APIServer pid=1406484) INFO:     127.0.0.1:50888 - "POST /v1/chat/completions HTTP/1.1" 200 OK

After:

ng: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:12:03 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:12:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=1478603) WARNING 02-25 09:18:48 [protocol.py:51] The following fields were present in the request but ignored: {'modalities'}
(APIServer pid=1478603) INFO 02-25 09:18:48 [async_omni.py:327] [AsyncOrchestrator] Entering scheduling loop: stages=3, final_stage=2
[Stage-0] INFO 02-25 09:18:53 [loggers.py:259] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1484185) [Stage-1] INFO 02-25 09:18:54 [mrope.py:345] Multimodal token idx changed!
[Stage-1] INFO 02-25 09:18:58 [loggers.py:259] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 16.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:19:03 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(Worker pid=1485435) [Stage-2] INFO 02-25 09:19:04 [mrope.py:345] Multimodal token idx changed!
(APIServer pid=1478603) WARNING 02-25 09:19:05 [serving_chat.py:1304] final output type: text is not needed by the request
(APIServer pid=1478603) INFO:     127.0.0.1:50892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[Stage-1] INFO 02-25 09:19:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-0] INFO 02-25 09:19:13 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:19:14 [loggers.py:259] Engine 000: Avg prompt throughput: 660.8 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-1] INFO 02-25 09:19:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[Stage-2] INFO 02-25 09:19:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin changed the title [Bugfix] set num_cached_tokens = num_computed_tokens if unset [Bugfix] Set num_cached_tokens = num_computed_tokens if unset Feb 25, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One minor question: should the num_cached_tokens assignment be done unconditionally (i.e. drop the if < 0 guard) to match the base scheduler's behavior exactly? The base scheduler always sets num_cached_tokens = num_computed_tokens without a guard. The defensive check is fine for now, just wondering if there's a reason to prefer it over the unconditional approach.

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction on my earlier comment: I checked the base vLLM scheduler (vllm/v1/core/sched/scheduler.py:798) and it also uses the same if request.num_cached_tokens < 0 guard. So the approach here is perfectly consistent with upstream. No concerns at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants