Skip to content

[Model] Add qwen3-guard model support with streaming input and output.#25463

Open
sighingnow wants to merge 1 commit intovllm-project:mainfrom
sighingnow:dev/stream-guard-v1
Open

[Model] Add qwen3-guard model support with streaming input and output.#25463
sighingnow wants to merge 1 commit intovllm-project:mainfrom
sighingnow:dev/stream-guard-v1

Conversation

@sighingnow
Copy link
Copy Markdown
Collaborator

@sighingnow sighingnow commented Sep 23, 2025

FIX #31975

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models labels Sep 23, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Sep 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the qwen3-guard model and a new resumable request feature to enable streaming input. The changes are extensive, touching many parts of the v1 engine, scheduler, and worker. The implementation is mostly solid and well-integrated. However, I've identified a couple of potential issues regarding robustness and correctness that should be addressed.

Comment on lines +1107 to +1110
if finish_forever:
request.resumable = False
if not prompt_token_ids:
prompt_token_ids = [0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a dummy token [0] to finalize a resumable request is a bit of a hack and could lead to incorrect behavior. The token 0 might be a meaningful token for some models (e.g., <s> or <unk>), and processing it could alter the model's state unexpectedly.

A more robust solution would be to handle the finalization of resumable requests without modifying the input tokens. For example, you could introduce a new state or flag in the Request object to signal that it's the final step, and the scheduler can handle it accordingly without needing an extra token to process.

prompt_lens: torch.Tensor, device: torch.device):
assert len(prompt_lens) == len(num_scheduled_tokens)

n_seq = len(num_scheduled_tokens)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion assert len(prompt_lens) == len(num_scheduled_tokens) was removed, but it seems important for correctness. prompt_lens and num_scheduled_tokens should correspond to the same set of requests in the batch, so their lengths should be equal. If they are not, it could lead to broadcasting errors or incorrect behavior downstream, for example in MeanPool. It would be safer to re-add this assertion to catch potential bugs early.

Suggested change
n_seq = len(num_scheduled_tokens)
assert len(prompt_lens) == len(num_scheduled_tokens)
n_seq = len(num_scheduled_tokens)

@maxdebayser
Copy link
Copy Markdown
Contributor

Interesting. Can you elaborate on the use case for streaming the input? And for streaming the output? Let's say that we had prefix caching and chunked prefill for ALL pooling, would that meet the requirements for your use case?

Also if the requests are resumable, how should there be a timeout to evict the request? Otherwise it would be fairly easy to cause a denial of service in vLLM.

Comment on lines +3706 to +3712
if self.model_config.architecture == "Qwen3ForGuardModel":
logger.info(
"Enable qwen3_guard logits computation, disable prefix caching."
)
self.scheduler_config.long_prefill_token_threshold = 0
if self.cache_config is not None:
self.cache_config.enable_prefix_caching = False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should do this in vllm/model_executor/models/config.py by defining verify_and_update_config for the model

@mergify
Copy link
Copy Markdown

mergify bot commented Oct 8, 2025

Documentation preview: https://vllm--25463.org.readthedocs.build/en/25463/

@sfbemerk
Copy link
Copy Markdown
Contributor

sfbemerk commented Oct 8, 2025

If I understand correctly, this PR does not add support for online serving. Will it be extended, or could you provide an example on how to achieve that?

When I run

pip install git+https://github.com/sighingnow/vllm@dev/stream-guard-v1
vllm serve "Qwen/Qwen3Guard-Stream-8B" --max-model-len 8192

I get the model_loader error

"ValueError: Following weights were not initialized from checkpoint: {'lm_head.weight'}"

Comment on lines +3706 to +3713
if self.model_config.architecture == "Qwen3ForGuardModel":
logger.info(
"Enable qwen3_guard logits computation, disable prefix caching."
)
self.scheduler_config.long_prefill_token_threshold = 0
if self.cache_config is not None:
self.cache_config.enable_prefix_caching = False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen in vllm/model_executor/models/qwen3_guard.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase new-model Requests to new models qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Qwen3Guard Stream

6 participants