fix: reject non-text content in system/developer messages#33981
fix: reject non-text content in system/developer messages#33981veeceey wants to merge 3 commits intovllm-project:mainfrom
Conversation
6de61df to
cebb571
Compare
|
Hi @veeceey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request correctly identifies and addresses a deviation from the OpenAI API specification by restricting system and developer messages to text-only content. The implementation adds a new validation function and comprehensive tests to ensure compliance. While the approach is sound, I've found a critical flaw in the validation logic. It doesn't account for simplified multimodal content formats (e.g., without an explicit type key), which allows the validation to be bypassed. My review includes a suggested fix to make the validation more robust by leveraging existing parsing logic.
vllm/entrypoints/chat_utils.py
Outdated
| for part in content: | ||
| if isinstance(part, str): | ||
| continue | ||
| part_type = part.get("type") | ||
| if part_type is not None and part_type not in _TEXT_CONTENT_TYPES: | ||
| raise ValueError( | ||
| f"Content part type '{part_type}' is not supported " | ||
| f"in '{role}' messages. Only text content is accepted " | ||
| f"for '{role}' role messages." | ||
| ) |
There was a problem hiding this comment.
The validation logic here is incomplete as it only checks for an explicit type key in a content part. This allows non-text content to be accepted if specified in a simplified format (e.g., {"image_url": "..."}) or when a uuid is present, which alters type inference. This bypasses the intended validation, potentially leading to unexpected behavior.
To ensure the validation is robust, it should infer the content type using the same logic as _parse_chat_message_content_mm_part. Reusing this parsing logic for type inference will make the validation more accurate and prevent this bypass.
for part in content:
if isinstance(part, str):
continue
# We must use the same part type inference logic from
# `_parse_chat_message_content_mm_part` to correctly validate
# all possible input formats. This includes simplified formats where
# the 'type' key is omitted, or where a 'uuid' can override the type.
try:
part_type, _ = _parse_chat_message_content_mm_part(part)
except ValueError:
# If the part is malformed, let the main parsing logic handle it.
# For this validation, we can assume it's not a non-text part.
continue
if part_type not in _TEXT_CONTENT_TYPES:
raise ValueError(
f"Content part type '{part_type}' is not supported "
f"in '{role}' messages. Only text content is accepted "
f"for '{role}' role messages."
)There was a problem hiding this comment.
Good catch! Fixed. The validation now also checks for the presence of known multimodal dict keys (image_url, audio_url, video_url, input_audio, image_pil, image_embeds, audio_embeds) in addition to the explicit type field. This means content like {"image_url": "..."} without a type field will now be correctly rejected for system/developer roles.
I added a _MULTIMODAL_CONTENT_KEYS frozenset and the inline check uses set(part.keys()) & _MULTIMODAL_CONTENT_KEYS to detect these cases. New tests have been added to cover all the no-type-key scenarios.
vllm/entrypoints/chat_utils.py
Outdated
|
|
||
| See: https://platform.openai.com/docs/api-reference/chat/create | ||
| """ | ||
| for part in content: |
There was a problem hiding this comment.
Could we move this into another existing for-loop to avoid introducing an extra loop?
There was a problem hiding this comment.
Done! I've removed the separate _validate_text_only_content() function and its dedicated loop entirely. The validation now happens inline inside _parse_chat_message_content_part(), which is already called for each part in the existing for-loop in _parse_chat_message_content_parts(). This avoids introducing an extra iteration over the content parts.
The role parameter is now passed through so the per-part function can check text-only constraints before proceeding with multimodal parsing.
Address reviewer feedback on PR vllm-project#33981: 1. Merge the separate `_validate_text_only_content()` pre-scan loop into the existing per-part loop inside `_parse_chat_message_content_part()`, eliminating the extra iteration over content parts. 2. Detect multimodal content even when the `type` key is absent by checking for known multimodal dict keys (image_url, audio_url, video_url, input_audio, image_pil, image_embeds, audio_embeds). This closes the gap where `{"image_url": "..."}` (without a `type` field) would bypass the validation. Signed-off-by: Varun Chawla <varun_6april@hotmail.com>
Address reviewer feedback on PR vllm-project#33981: 1. Merge the separate `_validate_text_only_content()` pre-scan loop into the existing per-part loop inside `_parse_chat_message_content_part()`, eliminating the extra iteration over content parts. 2. Detect multimodal content even when the `type` key is absent by checking for known multimodal dict keys (image_url, audio_url, video_url, input_audio, image_pil, image_embeds, audio_embeds). This closes the gap where `{"image_url": "..."}` (without a `type` field) would bypass the validation. Signed-off-by: Varun Chawla <varun_6april@hotmail.com> Signed-off-by: veeceey <veeceey@users.noreply.github.com>
0616598 to
9d8cabc
Compare
Manual test results for system/developer message validationRan 22 manual tests against the validation logic in What was tested:
Looks good. |
|
Hi @chaunceyjiang, friendly ping — I've addressed your feedback by moving the validation into the existing for-loop (no extra loop introduced) and also handling inferred multimodal types. All CI checks are passing. Would you be able to take another look? Thank you! |
Per the OpenAI API spec, system and developer messages only accept text content. Add validation inside the existing per-part parsing loop to reject multimodal content (image_url, audio_url, video_url, input_audio, etc.) for these roles. Handles both explicit type fields and inferred types from dict keys, preventing bypasses via simplified multimodal formats. Fixes vllm-project#33925 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>
9d8cabc to
d1831c1
Compare
|
I thought we agreed in #34072 that we make this a warning only. |
|
Thanks @DarkLight1337, you're right! I'll update this PR to use a warning instead of raising an error, consistent with what we settled on in #34072. Will push the update shortly. |
Instead of raising a ValueError when system/developer messages contain non-text content, issue a logger.warning and skip the part. This is consistent with the decision in vllm-project#34072. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Good catch — changed the validation to issue a warning instead of raising an error, consistent with the decision in #34072. |
| "for '%s' role messages. Skipping this content part.", | ||
| label, role, role, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
We should still allow it. But we can log a warning that it is outside of OpenAI spec.
There was a problem hiding this comment.
So I think this PR isn't really needed as #34072 does that already
|
Hi @veeceey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Add missing blank lines after import statements and split long function arguments across multiple lines per project style guide. Signed-off-by: Varun Chawla <varun_6april@hotmail.com>
|
Thanks @DarkLight1337 — you're right that #34072 covers the warning-only approach. I've just pushed a commit that fixes the pre-commit formatting issues (missing blank lines after However, if #34072 already fully handles this, I'm happy to close this PR. Could you confirm whether #34072 covers the same validation paths (both explicit |
|
Thanks @DarkLight1337 for confirming. Since #34072 covers the same validation with the warning-only approach, I'll go ahead and close this PR to avoid duplication. Appreciate the guidance! |
|
Closing as this is superseded by #34072. Thanks @DarkLight1337 for confirming! |
Summary
Fixes #33925
Per the OpenAI API specification,
systemanddeveloperrole messages should only accepttextcontent type. Previously, vLLM allowed multimodal content (e.g.image_url,input_audio,video_url) in system messages without any validation, which diverges from the OpenAI API behavior.Changes
vllm/entrypoints/chat_utils.py: Added a_validate_text_only_content()function that checks content parts for system/developer messages and raises aValueErrorwhen non-text content types (e.g.image_url,input_audio,video_url) are found. The validation runs inside_parse_chat_message_content()before content parts are parsed, ensuring both sync and async code paths are covered. TheValueErroris caught by the serving layer's existing error handling and returned as a proper error response.tests/entrypoints/test_chat_utils.py: Added parametrized tests covering:image_url,input_audio, andvideo_urlcontent in bothsystemanddeveloperrolesTest plan
test_system_message_rejects_non_text_content-- verifiesValueErroris raised forimage_url,input_audio,video_urlin system/developer messagestest_system_message_accepts_text_content-- verifies text content parts are acceptedtest_system_message_accepts_string_content-- verifies plain string content is accepted