[Model] Add qwen3-guard model support with streaming input and output. by sighingnow · Pull Request #25463 · vllm-project/vllm

sighingnow · 2025-09-23T08:04:17Z

FIX #31975

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

mergify · 2025-09-23T08:04:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for the qwen3-guard model and a new resumable request feature to enable streaming input. The changes are extensive, touching many parts of the v1 engine, scheduler, and worker. The implementation is mostly solid and well-integrated. However, I've identified a couple of potential issues regarding robustness and correctness that should be addressed.

gemini-code-assist · 2025-09-23T08:06:05Z

vllm/v1/core/sched/scheduler.py

+        if finish_forever:
+            request.resumable = False
+            if not prompt_token_ids:
+                prompt_token_ids = [0]


Using a dummy token [0] to finalize a resumable request is a bit of a hack and could lead to incorrect behavior. The token 0 might be a meaningful token for some models (e.g., <s> or <unk>), and processing it could alter the model's state unexpectedly.

A more robust solution would be to handle the finalization of resumable requests without modifying the input tokens. For example, you could introduce a new state or flag in the Request object to signal that it's the final step, and the scheduler can handle it accordingly without needing an extra token to process.

gemini-code-assist · 2025-09-23T08:06:05Z

vllm/v1/pool/metadata.py

                         prompt_lens: torch.Tensor, device: torch.device):
-    assert len(prompt_lens) == len(num_scheduled_tokens)
-
    n_seq = len(num_scheduled_tokens)


The assertion assert len(prompt_lens) == len(num_scheduled_tokens) was removed, but it seems important for correctness. prompt_lens and num_scheduled_tokens should correspond to the same set of requests in the batch, so their lengths should be equal. If they are not, it could lead to broadcasting errors or incorrect behavior downstream, for example in MeanPool. It would be safer to re-add this assertion to catch potential bugs early.

Suggested change

n_seq = len(num_scheduled_tokens)

assert len(prompt_lens) == len(num_scheduled_tokens)

n_seq = len(num_scheduled_tokens)

maxdebayser · 2025-09-23T16:07:34Z

Interesting. Can you elaborate on the use case for streaming the input? And for streaming the output? Let's say that we had prefix caching and chunked prefill for ALL pooling, would that meet the requirements for your use case?

Also if the requests are resumable, how should there be a timeout to evict the request? Otherwise it would be fairly easy to cause a denial of service in vLLM.

mgoin · 2025-09-23T21:09:36Z

vllm/config/__init__.py

+        if self.model_config.architecture == "Qwen3ForGuardModel":
+            logger.info(
+                "Enable qwen3_guard logits computation, disable prefix caching."
+            )
+            self.scheduler_config.long_prefill_token_threshold = 0
+            if self.cache_config is not None:
+                self.cache_config.enable_prefix_caching = False


nit: I think we should do this in vllm/model_executor/models/config.py by defining verify_and_update_config for the model

mergify · 2025-10-08T14:41:31Z

Documentation preview: https://vllm--25463.org.readthedocs.build/en/25463/

sfbemerk · 2025-10-08T20:11:34Z

If I understand correctly, this PR does not add support for online serving. Will it be extended, or could you provide an example on how to achieve that?

When I run

pip install git+https://github.com/sighingnow/vllm@dev/stream-guard-v1
vllm serve "Qwen/Qwen3Guard-Stream-8B" --max-model-len 8192

I get the model_loader error

"ValueError: Following weights were not initialized from checkpoint: {'lm_head.weight'}"

hmellor · 2025-10-09T16:48:11Z

vllm/config/__init__.py

+        if self.model_config.architecture == "Qwen3ForGuardModel":
+            logger.info(
+                "Enable qwen3_guard logits computation, disable prefix caching."
+            )
+            self.scheduler_config.long_prefill_token_threshold = 0
+            if self.cache_config is not None:
+                self.cache_config.enable_prefix_caching = False
+


This should happen in vllm/model_executor/models/qwen3_guard.py

[Model] Add qwen3-guard model support with streaming input and output.

b522d23

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

sighingnow requested review from ApostaC, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 23, 2025 08:04

mergify bot added documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models labels Sep 23, 2025

mergify bot added v1 needs-rebase labels Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

mgoin reviewed Sep 23, 2025

View reviewed changes

hmellor reviewed Oct 9, 2025

View reviewed changes

KevinZeng08 mentioned this pull request Oct 21, 2025

[Feature]: Streaming multi-modal input/output #25066

Closed

patrickvonplaten mentioned this pull request Dec 3, 2025

[Feature] add session based streaming input support to v1 #28973

Merged

5 tasks

Wildshire mentioned this pull request Jan 8, 2026

[New Model]: Qwen3Guard Stream #31975

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Add qwen3-guard model support with streaming input and output.#25463

[Model] Add qwen3-guard model support with streaming input and output.#25463
sighingnow wants to merge 1 commit intovllm-project:mainfrom
sighingnow:dev/stream-guard-v1

sighingnow commented Sep 23, 2025 •

edited by DarkLight1337

Loading

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 23, 2025

Uh oh!

gemini-code-assist bot Sep 23, 2025

Uh oh!

maxdebayser commented Sep 23, 2025

Uh oh!

mgoin Sep 23, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

sfbemerk commented Oct 8, 2025

Uh oh!

hmellor Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	n_seq = len(num_scheduled_tokens)
	assert len(prompt_lens) == len(num_scheduled_tokens)
	n_seq = len(num_scheduled_tokens)

Uh oh!

Conversation

sighingnow commented Sep 23, 2025 • edited by DarkLight1337 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser commented Sep 23, 2025

Uh oh!

mgoin Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

sfbemerk commented Oct 8, 2025

Uh oh!

hmellor Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sighingnow commented Sep 23, 2025 •

edited by DarkLight1337

Loading