[Model] Add qwen3-guard model support with streaming input and output.#25463
[Model] Add qwen3-guard model support with streaming input and output.#25463sighingnow wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the qwen3-guard model and a new resumable request feature to enable streaming input. The changes are extensive, touching many parts of the v1 engine, scheduler, and worker. The implementation is mostly solid and well-integrated. However, I've identified a couple of potential issues regarding robustness and correctness that should be addressed.
| if finish_forever: | ||
| request.resumable = False | ||
| if not prompt_token_ids: | ||
| prompt_token_ids = [0] |
There was a problem hiding this comment.
Using a dummy token [0] to finalize a resumable request is a bit of a hack and could lead to incorrect behavior. The token 0 might be a meaningful token for some models (e.g., <s> or <unk>), and processing it could alter the model's state unexpectedly.
A more robust solution would be to handle the finalization of resumable requests without modifying the input tokens. For example, you could introduce a new state or flag in the Request object to signal that it's the final step, and the scheduler can handle it accordingly without needing an extra token to process.
| prompt_lens: torch.Tensor, device: torch.device): | ||
| assert len(prompt_lens) == len(num_scheduled_tokens) | ||
|
|
||
| n_seq = len(num_scheduled_tokens) |
There was a problem hiding this comment.
The assertion assert len(prompt_lens) == len(num_scheduled_tokens) was removed, but it seems important for correctness. prompt_lens and num_scheduled_tokens should correspond to the same set of requests in the batch, so their lengths should be equal. If they are not, it could lead to broadcasting errors or incorrect behavior downstream, for example in MeanPool. It would be safer to re-add this assertion to catch potential bugs early.
| n_seq = len(num_scheduled_tokens) | |
| assert len(prompt_lens) == len(num_scheduled_tokens) | |
| n_seq = len(num_scheduled_tokens) |
|
Interesting. Can you elaborate on the use case for streaming the input? And for streaming the output? Let's say that we had prefix caching and chunked prefill for ALL pooling, would that meet the requirements for your use case? Also if the requests are resumable, how should there be a timeout to evict the request? Otherwise it would be fairly easy to cause a denial of service in vLLM. |
| if self.model_config.architecture == "Qwen3ForGuardModel": | ||
| logger.info( | ||
| "Enable qwen3_guard logits computation, disable prefix caching." | ||
| ) | ||
| self.scheduler_config.long_prefill_token_threshold = 0 | ||
| if self.cache_config is not None: | ||
| self.cache_config.enable_prefix_caching = False |
There was a problem hiding this comment.
nit: I think we should do this in vllm/model_executor/models/config.py by defining verify_and_update_config for the model
|
Documentation preview: https://vllm--25463.org.readthedocs.build/en/25463/ |
|
If I understand correctly, this PR does not add support for online serving. Will it be extended, or could you provide an example on how to achieve that? When I run I get the model_loader error
|
| if self.model_config.architecture == "Qwen3ForGuardModel": | ||
| logger.info( | ||
| "Enable qwen3_guard logits computation, disable prefix caching." | ||
| ) | ||
| self.scheduler_config.long_prefill_token_threshold = 0 | ||
| if self.cache_config is not None: | ||
| self.cache_config.enable_prefix_caching = False | ||
|
|
There was a problem hiding this comment.
This should happen in vllm/model_executor/models/qwen3_guard.py
FIX #31975